Remigium

Friday, 14 December 2018

ORA-07286: sksagdi: cannot obtain device information

Lately we had an issue regarding a Dataguard configuration - quite myesterious.
In dgmgrl there were reported standard errors about broken transport and inability to solve the gap and ORA-16810, which is actually the error, which just tell You, something is wrong (and not what is wrong).
The behaviour was as follows - the apply processes worked ok, so if I restored the archivelogs on the standby, they were applied even up to the current log. But then nothing - so it was clear there was a problem with the transport. But resetting connections was in vain, we also changed some parameters in the dataguard configuration, still nothing.
Finally the right solution was to restart the standby, which was a bit counterintuitive as the transport is mainly the feature of the primary. This was also fortunate, as there was no problem to perform that at once without service disruption.

Thursday, 22 November 2018

Logon storms

https://www.slideshare.net/tanelp/tanel-poder-troubleshooting-complex-oracle-performance-issues-part-1

RATE_LIMIT parameter in listener.ora
_logout_storm_rate - instance parameter

Tuesday, 2 October 2018

Work with EMCLI

Starting the work

Initially one have to login - without that the commands are not recognized as valid and even help is reduced

[oracle@host ~]$ emcli help
Summary of commands:


  Basic Administration Verbs
    argfile    -- Execute emcli verbs from a file
    help       -- Get help for emcli verbs (Usage: emcli help [verb_name])
    invoke_ws     -- Invoke EM web service.
    login      -- Login to the EM Management Server (OMS)
    logout     -- Logout from the EM Management Server
    setup      -- Setup emcli to work with an EM Management Server
    status     -- List emcli configuration details
    sync       -- Synchronize with the EM Management Server
    version    -- List emcli verb versions or the emcli client version

  To get the help for a single verb, execute - emcli help

So we start

[oracle@host ~]$ emcli login -username=sysman
Enter password 

Login successful

Now help looks way more elaborate:

[oracle@host ~]$ emcli help
Summary of commands:


  Basic Administration Verbs
    argfile    -- Execute emcli verbs from a file
    help       -- Get help for emcli verbs (Usage: emcli help [verb_name])
    invoke_ws     -- Invoke EM web service.
    login      -- Login to the EM Management Server (OMS)
    logout     -- Logout from the EM Management Server
    setup      -- Setup emcli to work with an EM Management Server
    status     -- List emcli configuration details
    sync       -- Synchronize with the EM Management Server
    version    -- List emcli verb versions or the emcli client version

  ADM Verbs
    associate_target_to_adm     -- Associate a target to an application data model.
    export_adm     -- Export Application Data Model to a specified .xml file.
    import_adm     -- Import Application Data Model from a specified .xml file.
[..]

  WebSphere Target Management Verbs
    discover_was    -- Discover a IBM WebSphere Cell or a IBM Websphere Standalone Server
    refresh_was    -- Refresh IBM WebSphere Cells
argus
  To get the help for a single verb, execute - emcli help

Adding a new blackout reason

## to add:
[oracle@host ~]$ emcli add_blackout_reason -name="Target disabled intentionally"
Added Blackout Reason "Target disabled intentionally" successfully

## to list
[oracle@host ~]$ emcli get_blackout_reasons
AS: Application Server Bounce
AS: Application Server Configuration Change
AS: Application Server Emergency
[..]

Tuesday, 11 September 2018

ASM: moving disk between groups

SQL> select name||':'||path from v$asm_disk;

[..]
DATA_0008:/dev/oracleasm/disks/ASMDISK9

-- the operation
SQL> alter diskgroup DATA drop disk {chosen name from previous query} [rebalance power {speed between 0 and 11}];

-- the operation progress
SQL> select operation, state, est_minutes from v$asm_operation;

OPERA STAT EST_MINUTES
----- ---- -----------
REBAL RUN      7
REBAL WAIT      0

SQL> select GROUP_NUMBER,name,STATE from v$asm_disk where name='DATA_0008';

GROUP_NUMBER NAME       STATE
------------ ------------------------------ --------
    1 DATA_0008       DROPPING

-- ASM alert
Tue Sep 11 10:40:51 2018
NOTE: requesting all-instance membership refresh for group=1
Tue Sep 11 10:40:51 2018
GMON updating for reconfiguration, group 1 at 14 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk DATA_0008 emptied
NOTE: erasing header (replicated) on grp 1 disk DATA_0008
NOTE: erasing header on grp 1 disk DATA_0008
NOTE: process _x000_+asm (19142) initiating offline of disk 8.3915952393 (DATA_0008) with mask 0x7e in group 1 (DATA) without client assisting
NOTE: initiating PST update: grp 1 (DATA), dsk = 8/0xe968b109, mask = 0x6a, op = clear
Tue Sep 11 10:40:51 2018
GMON updating disk modes for group 1 at 15 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: PST update grp = 1 completed successfully 
NOTE: initiating PST update: grp 1 (DATA), dsk = 8/0xe968b109, mask = 0x7e, op = clear
Tue Sep 11 10:40:51 2018
GMON updating disk modes for group 1 at 16 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: cache closing disk 8 of grp 1: DATA_0008
Tue Sep 11 10:40:51 2018
NOTE: cache closing disk 8 of grp 1: (not open) DATA_0008
Tue Sep 11 10:40:51 2018
NOTE: PST update grp = 1 completed successfully 
Tue Sep 11 10:40:51 2018
GMON updating for reconfiguration, group 1 at 17 for pid 30, osid 19142
Tue Sep 11 10:40:51 2018
NOTE: cache closing disk 8 of grp 1: (not open) DATA_0008
Tue Sep 11 10:40:51 2018
NOTE: group 1 PST updated.
Tue Sep 11 10:40:51 2018
NOTE: membership refresh pending for group 1/0xb68400e (DATA)
Tue Sep 11 10:40:51 2018
GMON querying group 1 at 18 for pid 16, osid 10242
GMON querying group 1 at 19 for pid 16, osid 10242
Tue Sep 11 10:40:51 2018
NOTE: Disk DATA_0008 in mode 0x0 marked for de-assignment
SUCCESS: refreshed membership for 1/0xb68400e (DATA)
NOTE: stopping process ARB0
NOTE: Attempting voting file refresh on diskgroup DATA
Tue Sep 11 10:40:54 2018
SUCCESS: rebalance completed for group 1/0xb68400e (DATA)

SQL> select operation, state, est_minutes from v$asm_operation;

no rows selected

SQL> select GROUP_NUMBER,name,STATE from v$asm_disk where name='DATA_0008';

no rows selected

SQL> alter diskgroup reco  add disk '/dev/oracleasm/disks/ASMDISK9';

Diskgroup altered

-- ASM alert entry
Tue Sep 11 10:45:43 2018
NOTE: Assigning number (2,1) to disk (/dev/oracleasm/disks/ASMDISK9)
NOTE: requesting all-instance membership refresh for group=2
NOTE: Disk 1 in group 2 is assigned fgnum=2
NOTE: discarding redo for group 2 disk 1
NOTE: initializing header (replicated) on grp 2 disk RECO_0001
NOTE: initializing header on grp 2 disk RECO_0001
NOTE: requesting all-instance disk validation for group=2
Tue Sep 11 10:45:44 2018
NOTE: skipping rediscovery for group 2/0xb78400f (RECO) on local instance.
Tue Sep 11 10:45:44 2018
NOTE: requesting all-instance disk validation for group=2
Tue Sep 11 10:45:44 2018
NOTE: skipping rediscovery for group 2/0xb78400f (RECO) on local instance.
Tue Sep 11 10:45:44 2018
GMON updating for reconfiguration, group 2 at 20 for pid 28, osid 20206
Tue Sep 11 10:45:44 2018
NOTE: group 2 PST updated.
Tue Sep 11 10:45:44 2018
NOTE: membership refresh pending for group 2/0xb78400f (RECO)
Tue Sep 11 10:45:44 2018
GMON querying group 2 at 21 for pid 16, osid 10242
NOTE: cache opening disk 1 of grp 2: RECO_0001 path:/dev/oracleasm/disks/ASMDISK9
GMON querying group 2 at 22 for pid 16, osid 10242
Tue Sep 11 10:45:44 2018
SUCCESS: refreshed membership for 2/0xb78400f (RECO)
Tue Sep 11 10:45:44 2018
SUCCESS: alter diskgroup reco  add disk '/dev/oracleasm/disks/ASMDISK9'
NOTE: Attempting voting file refresh on diskgroup RECO
Tue Sep 11 10:45:45 2018
NOTE: starting rebalance of group 2/0xb78400f (RECO) at power 1
Starting background process ARB0
Tue Sep 11 10:45:45 2018
ARB0 started with pid=31, OS id=20232 
NOTE: assigning ARB0 to group 2/0xb78400f (RECO) with 1 parallel I/O

Wednesday, 11 July 2018

Broken DG configuration

Few days ago I 've run into the following situation on a DG configuration:

the primary shipped logs to the standby properly, but reported errors if asked for status in dgmgrl

DGMGRL> show configuration

Configuration - dbd

  Protection Mode: MaxPerformance
  Databases:
    db1d - Primary database
    db2d - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ORA-16610: command "Broker operation" in progress
DGM-17017: unable to determine configuration status

the standby initially did not process archivelogs, but after configuration reset (disable and enable operations) it started to process them, but also reported errors in dgmgrl
it was possible to log in from the primary to the standby
it was possible to log in from the standby to the primary
I copied password file from the primary to the standby to ensure this is the same file
when calling show database on the primary the errors were reported for the standby
when calling show database on the standby the errors were reported for the primary

What was the cause?
It showed that the problem was generated by the TNS entries on the primary - apparently the TNS alias definitions for the primary and the standby there included some blank characters, which caused the alias to be read improperly in case of dgmgrl - at the same time simple call to those aliases through sqlplus went ok.
Once I removed all the formatting in the tnsnames.ora, whole thing started to work properly. What is a bit interesting is that the errors in the TNS were present only on the primary, while the dgmgrl calls malfunctioned on both ends.

Role order

Lately a colleague of mine run into the problem with too many roles - some account have them 150, while maximally allowed are 148. He wanted to establish the order in which the roles are assigned to the user as obviously only the excess over 148 was thrown away.
He has found that the roles are loaded with the following query:

select /*+ connect_by_filtering */ privilege#,level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0;

Thus it seems as the the order may be random, though it is possible that usually the data is returned in the order of insertion.

Friday, 29 June 2018

ORA-01554: transaction concurrency limit reached reason:no undo segment found with available slot params:0, 0 :

Some time ago I found such error on one of our databases. It is a bit mysterious as there are not much about it on Metalink (few bugs) or on the internet (at least when searching by Google). But indeed the database hit a limit of concurrent transactions in some very short period of time. On the performance chart in OEM there is a high peak with an INSERT to some custom table, with waits on TX - index contention, buffer busy, enq: SQ contention and undo segment tx slot.
The database is working with UNDO_MANAGEMENT=AUTO, which is a default setting and since 10g used very widely, so the administrator may simply forget about rollback segments management. According to the docs with this setting some work parameters relevant to the rollback management are simply ignored (TRANSACTIONS, TRANSACTIONS_PER_ROLLBACK_SEGMENT), while the number of transactions is limited by the undo tablespace size.
So it seems we should not experience this error, but rather session suspension and the error about lack of space in the undo tablespace.
Yet here are are. It is clear from the numbers below, that still there is a space in the undo (up to 4598 undo blocks used
What I could check is this:

select BEGIN_TIME, UNDOTSN, UNDOBLKS, TXNCOUNT, MAXQUERYLEN, MAXQUERYID, MAXCONCURRENCY, 
ACTIVEBLKS, UNEXPIREDBLKS, EXPIREDBLKS, TUNED_UNDORETENTION 
from v$undostat 
where begin_time between to_date('2018-06-25 14', 'yyyy-mm-dd hh24') and to_date('2018-06-25 16', 'yyyy-mm-dd hh24') 
order by begin_time;

BEGIN_TIME	UNDOBLKS	TXNCOUNT	MAXQUERYLEN	MAXCONCURRENCY	ACTIVEBLKS	UNEXPIREDBLKS	EXPIREDBLKS	TUNED_UNDORETENTION
2018-06-25 14:39	0	78	1652	0	160	0	7168	2372
2018-06-25 14:49	1411	91	1653	2	160	0	7168	2373
2018-06-25 14:59	4598	104225	1653	131	160	4736	5888	2373
2018-06-25 15:09	2169	103321	933	311	160	6656	4352	1653
2018-06-25 15:19	2	282	1734	1	160	8064	2816	2454

Since the midnight the number of undo blocks consumed was 63475 with 524416 available blocks (so ~12%) and maximum query length was below 1h. Thus definitely we were far below the undo space limit. From this figures it is clear that still it is possible to hit an error typical for MANUAL undo management scenario, while on AUTO - as here we are.

And here few words of wisdom from Jonathan Lewis:

Taking a quick guess:

An (automatic) undo segment has 34 slots in the transaction table (the thing in the undo segment header).
If you and up with a small number N of undo segments then you can't have more than 34 * N concurrent transactions active because each one needs a transaction table slot and the next attempt to start a transaction would see ORA-01554

It seems a little unlikely that this would happan in typical system since Oracle would usually end up stealing an extent from an existing undo segment to use it as the first extent of a new undo segment. But it wouldn't be hard to set up a mechanism that started with a small number of undo segments in a relatively small undo tablespace - started some transactions (then leave them idle forever after the first change - e.g. they might have come in from a remote database) that made it impossible to free any extents, then did a lot of work to allocate all the extents (but still left a little space in allocated extents for continuting transactions), then started lots of transactions until you hit the error.

And Julian Dyke's presentation:

Undo segments are allocated at instance startup.
Undo segments can be added dynamically.

Each undo segment header contains

Pool of free undo extents
Set of undo slots

One undo slot is allocated to each transaction.
Undo slot contains list of undo extents.
Extents can migrate from one undo segment to another.
Undo slots are used cyclically:

remain in header as long as possible
reduces probability of ORA-01555: Snapshot too old

With all this in mind I would say that 1000 times higher transaction rate is in this particular case the main reason for the error. Apparently the db is not able to extend its concurrent transactions rate capability fast enough to respond to extended demand. Possible action here could be initial setting of _rollback_segment_count to a higher value than 12, which are set here.