Friday 14 December 2018

ORA-07286: sksagdi: cannot obtain device information

Lately we had an issue regarding a Dataguard configuration - quite myesterious.
In dgmgrl there were reported standard errors about broken transport and inability to solve the gap and ORA-16810, which is actually the error, which just tell You, something is wrong (and not what is wrong).
The behaviour was as follows - the apply processes worked ok, so if I restored the archivelogs on the standby, they were applied even up to the current log. But then nothing - so it was clear there was a problem with the transport. But resetting connections was in vain, we also changed some parameters in the dataguard configuration, still nothing.
Finally the right solution was to restart the standby, which was a bit counterintuitive as the transport is mainly the feature of the primary. This was also fortunate, as there was no problem to perform that at once without service disruption.

Thursday 22 November 2018

Tuesday 2 October 2018

Work with EMCLI

Starting the work

Initially one have to login - without that the commands are not recognized as valid and even help is reduced
[oracle@host ~]$ emcli help
Summary of commands:


  Basic Administration Verbs
    argfile    -- Execute emcli verbs from a file
    help       -- Get help for emcli verbs (Usage: emcli help [verb_name])
    invoke_ws     -- Invoke EM web service.
    login      -- Login to the EM Management Server (OMS)
    logout     -- Logout from the EM Management Server
    setup      -- Setup emcli to work with an EM Management Server
    status     -- List emcli configuration details
    sync       -- Synchronize with the EM Management Server
    version    -- List emcli verb versions or the emcli client version

  To get the help for a single verb, execute - emcli help  
So we start
[oracle@host ~]$ emcli login -username=sysman
Enter password 

Login successful
Now help looks way more elaborate:
[oracle@host ~]$ emcli help
Summary of commands:


  Basic Administration Verbs
    argfile    -- Execute emcli verbs from a file
    help       -- Get help for emcli verbs (Usage: emcli help [verb_name])
    invoke_ws     -- Invoke EM web service.
    login      -- Login to the EM Management Server (OMS)
    logout     -- Logout from the EM Management Server
    setup      -- Setup emcli to work with an EM Management Server
    status     -- List emcli configuration details
    sync       -- Synchronize with the EM Management Server
    version    -- List emcli verb versions or the emcli client version

  ADM Verbs
    associate_target_to_adm     -- Associate a target to an application data model.
    export_adm     -- Export Application Data Model to a specified .xml file.
    import_adm     -- Import Application Data Model from a specified .xml file.
[..]

  WebSphere Target Management Verbs
    discover_was    -- Discover a IBM WebSphere Cell or a IBM Websphere Standalone Server
    refresh_was    -- Refresh IBM WebSphere Cells
argus
  To get the help for a single verb, execute - emcli help  

Adding a new blackout reason

## to add:
[oracle@host ~]$ emcli add_blackout_reason -name="Target disabled intentionally"
Added Blackout Reason "Target disabled intentionally" successfully

## to list
[oracle@host ~]$ emcli get_blackout_reasons
AS: Application Server Bounce
AS: Application Server Configuration Change
AS: Application Server Emergency
[..]

Tuesday 11 September 2018

ASM: moving disk between groups

SQL> select name||':'||path from v$asm_disk;

[..]
DATA_0008:/dev/oracleasm/disks/ASMDISK9

-- the operation
SQL> alter diskgroup DATA drop disk {chosen name from previous query} [rebalance power {speed between 0 and 11}];

-- the operation progress
SQL> select operation, state, est_minutes from v$asm_operation;

OPERA STAT EST_MINUTES
----- ---- -----------
REBAL RUN      7
REBAL WAIT      0

SQL> select GROUP_NUMBER,name,STATE from v$asm_disk where name='DATA_0008';

GROUP_NUMBER NAME       STATE
------------ ------------------------------ --------
    1 DATA_0008       DROPPING

-- ASM alert
Tue Sep 11 10:40:51 2018
NOTE: requesting all-instance membership refresh for group=1
Tue Sep 11 10:40:51 2018
GMON updating for reconfiguration, group 1 at 14 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk DATA_0008 emptied
NOTE: erasing header (replicated) on grp 1 disk DATA_0008
NOTE: erasing header on grp 1 disk DATA_0008
NOTE: process _x000_+asm (19142) initiating offline of disk 8.3915952393 (DATA_0008) with mask 0x7e in group 1 (DATA) without client assisting
NOTE: initiating PST update: grp 1 (DATA), dsk = 8/0xe968b109, mask = 0x6a, op = clear
Tue Sep 11 10:40:51 2018
GMON updating disk modes for group 1 at 15 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: PST update grp = 1 completed successfully 
NOTE: initiating PST update: grp 1 (DATA), dsk = 8/0xe968b109, mask = 0x7e, op = clear
Tue Sep 11 10:40:51 2018
GMON updating disk modes for group 1 at 16 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: cache closing disk 8 of grp 1: DATA_0008
Tue Sep 11 10:40:51 2018
NOTE: cache closing disk 8 of grp 1: (not open) DATA_0008
Tue Sep 11 10:40:51 2018
NOTE: PST update grp = 1 completed successfully 
Tue Sep 11 10:40:51 2018
GMON updating for reconfiguration, group 1 at 17 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: cache closing disk 8 of grp 1: (not open) DATA_0008
Tue Sep 11 10:40:51 2018
NOTE: group 1 PST updated.
Tue Sep 11 10:40:51 2018
NOTE: membership refresh pending for group 1/0xb68400e (DATA)
Tue Sep 11 10:40:51 2018
GMON querying group 1 at 18 for pid 16, osid 10242
GMON querying group 1 at 19 for pid 16, osid 10242
Tue Sep 11 10:40:51 2018
NOTE: Disk DATA_0008 in mode 0x0 marked for de-assignment
SUCCESS: refreshed membership for 1/0xb68400e (DATA)
NOTE: stopping process ARB0
NOTE: Attempting voting file refresh on diskgroup DATA
Tue Sep 11 10:40:54 2018
SUCCESS: rebalance completed for group 1/0xb68400e (DATA)

SQL> select operation, state, est_minutes from v$asm_operation;

no rows selected

SQL> select GROUP_NUMBER,name,STATE from v$asm_disk where name='DATA_0008';

no rows selected

SQL> alter diskgroup reco  add disk '/dev/oracleasm/disks/ASMDISK9';

Diskgroup altered

-- ASM alert entry
Tue Sep 11 10:45:43 2018
NOTE: Assigning number (2,1) to disk (/dev/oracleasm/disks/ASMDISK9)
NOTE: requesting all-instance membership refresh for group=2
NOTE: Disk 1 in group 2 is assigned fgnum=2
NOTE: discarding redo for group 2 disk 1
NOTE: initializing header (replicated) on grp 2 disk RECO_0001
NOTE: initializing header on grp 2 disk RECO_0001
NOTE: requesting all-instance disk validation for group=2
Tue Sep 11 10:45:44 2018
NOTE: skipping rediscovery for group 2/0xb78400f (RECO) on local instance.
Tue Sep 11 10:45:44 2018
NOTE: requesting all-instance disk validation for group=2
Tue Sep 11 10:45:44 2018
NOTE: skipping rediscovery for group 2/0xb78400f (RECO) on local instance.
Tue Sep 11 10:45:44 2018
GMON updating for reconfiguration, group 2 at 20 for pid 28, osid 20206
Tue Sep 11 10:45:44 2018
NOTE: group 2 PST updated.
Tue Sep 11 10:45:44 2018
NOTE: membership refresh pending for group 2/0xb78400f (RECO)
Tue Sep 11 10:45:44 2018
GMON querying group 2 at 21 for pid 16, osid 10242
NOTE: cache opening disk 1 of grp 2: RECO_0001 path:/dev/oracleasm/disks/ASMDISK9
GMON querying group 2 at 22 for pid 16, osid 10242
Tue Sep 11 10:45:44 2018
SUCCESS: refreshed membership for 2/0xb78400f (RECO)
Tue Sep 11 10:45:44 2018
SUCCESS: alter diskgroup reco  add disk '/dev/oracleasm/disks/ASMDISK9'
NOTE: Attempting voting file refresh on diskgroup RECO
Tue Sep 11 10:45:45 2018
NOTE: starting rebalance of group 2/0xb78400f (RECO) at power 1
Starting background process ARB0
Tue Sep 11 10:45:45 2018
ARB0 started with pid=31, OS id=20232 
NOTE: assigning ARB0 to group 2/0xb78400f (RECO) with 1 parallel I/O



Wednesday 11 July 2018

Broken DG configuration

Few days ago I 've run into the following situation on a DG configuration:
  • the primary shipped logs to the standby properly, but reported errors if asked for status in dgmgrl
    DGMGRL> show configuration
    
    Configuration - dbd
    
      Protection Mode: MaxPerformance
      Databases:
        db1d - Primary database
        db2d - Physical standby database
    
    Fast-Start Failover: DISABLED
    
    Configuration Status:
    ORA-16610: command "Broker operation" in progress
    DGM-17017: unable to determine configuration status
    
  • the standby initially did not process archivelogs, but after configuration reset (disable and enable operations) it started to process them, but also reported errors in dgmgrl
  • it was possible to log in from the primary to the standby
  • it was possible to log in from the standby to the primary
  • I copied password file from the primary to the standby to ensure this is the same file
  • when calling show database on the primary the errors were reported for the standby
  • when calling show database on the standby the errors were reported for the primary
What was the cause?
It showed that the problem was generated by the TNS entries on the primary - apparently the TNS alias definitions for the primary and the standby there included some blank characters, which caused the alias to be read improperly in case of dgmgrl - at the same time simple call to those aliases through sqlplus went ok.
Once I removed all the formatting in the tnsnames.ora, whole thing started to work properly. What is a bit interesting is that the errors in the TNS were present only on the primary, while the dgmgrl calls malfunctioned on both ends.

Role order

Lately a colleague of mine run into the problem with too many roles - some account have them 150, while maximally allowed are 148. He wanted to establish the order in which the roles are assigned to the user as obviously only the excess over 148 was thrown away.
He has found that the roles are loaded with the following query:
select /*+ connect_by_filtering */ privilege#,level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0; 
Thus it seems as the the order may be random, though it is possible that usually the data is returned in the order of insertion.

Friday 29 June 2018

ORA-01554: transaction concurrency limit reached reason:no undo segment found with available slot params:0, 0 :

Some time ago I found such error on one of our databases. It is a bit mysterious as there are not much about it on Metalink (few bugs) or on the internet (at least when searching by Google). But indeed the database hit a limit of concurrent transactions in some very short period of time. On the performance chart in OEM there is a high peak with an INSERT to some custom table, with waits on TX - index contention, buffer busy, enq: SQ contention and undo segment tx slot.
The database is working with UNDO_MANAGEMENT=AUTO, which is a default setting and since 10g used very widely, so the administrator may simply forget about rollback segments management. According to the docs with this setting some work parameters relevant to the rollback management are simply ignored (TRANSACTIONS, TRANSACTIONS_PER_ROLLBACK_SEGMENT), while the number of transactions is limited by the undo tablespace size.
So it seems we should not experience this error, but rather session suspension and the error about lack of space in the undo tablespace.
Yet here are are. It is clear from the numbers below, that still there is a space in the undo (up to 4598 undo blocks used
What I could check is this:
select BEGIN_TIME, UNDOTSN, UNDOBLKS, TXNCOUNT, MAXQUERYLEN, MAXQUERYID, MAXCONCURRENCY, 
ACTIVEBLKS, UNEXPIREDBLKS, EXPIREDBLKS, TUNED_UNDORETENTION 
from v$undostat 
where begin_time between to_date('2018-06-25 14', 'yyyy-mm-dd hh24') and to_date('2018-06-25 16', 'yyyy-mm-dd hh24') 
order by begin_time;

BEGIN_TIMEUNDOBLKSTXNCOUNTMAXQUERYLENMAXCONCURRENCYACTIVEBLKSUNEXPIREDBLKSEXPIREDBLKSTUNED_UNDORETENTION
2018-06-25 14:3907816520160071682372
2018-06-25 14:4914119116532160071682373
2018-06-25 14:5945981042251653131160473658882373
2018-06-25 15:092169103321933311160665643521653
2018-06-25 15:19228217341160806428162454
Since the midnight the number of undo blocks consumed was 63475 with 524416 available blocks (so ~12%) and maximum query length was below 1h. Thus definitely we were far below the undo space limit. From this figures it is clear that still it is possible to hit an error typical for MANUAL undo management scenario, while on AUTO - as here we are.

And here few words of wisdom from Jonathan Lewis:
Taking a quick guess:

An (automatic) undo segment has 34 slots in the transaction table (the thing in the undo segment header).
If you and up with a small number N of undo segments then you can't have more than 34 * N concurrent transactions active because each one needs a transaction table slot and the next attempt to start a transaction would see ORA-01554

It seems a little unlikely that this would happan in typical system since Oracle would usually end up stealing an extent from an existing undo segment to use it as the first extent of a new undo segment. But it wouldn't be hard to set up a mechanism that started with a small number of undo segments in a relatively small undo tablespace - started some transactions (then leave them idle forever after the first change - e.g. they might have come in from a remote database) that made it impossible to free any extents, then did a lot of work to allocate all the extents (but still left a little space in allocated extents for continuting transactions), then started lots of transactions until you hit the error.

And Julian Dyke's presentation:
Undo segments are allocated at instance startup.
Undo segments can be added dynamically.

Each undo segment header contains
  • Pool of free undo extents
  • Set of undo slots

One undo slot is allocated to each transaction.
Undo slot contains list of undo extents.
Extents can migrate from one undo segment to another.
Undo slots are used cyclically:
  • remain in header as long as possible
  • reduces probability of ORA-01555: Snapshot too old

With all this in mind I would say that 1000 times higher transaction rate is in this particular case the main reason for the error. Apparently the db is not able to extend its concurrent transactions rate capability fast enough to respond to extended demand. Possible action here could be initial setting of _rollback_segment_count to a higher value than 12, which are set here.