Lately we had an issue regarding a Dataguard configuration - quite myesterious.
In dgmgrl there were reported standard errors about broken transport and inability to solve the gap and ORA-16810, which is actually the error, which just tell You, something is wrong (and not what is wrong).
The behaviour was as follows - the apply processes worked ok, so if I restored the archivelogs on the standby, they were applied even up to the current log. But then nothing - so it was clear there was a problem with the transport. But resetting connections was in vain, we also changed some parameters in the dataguard configuration, still nothing.
Finally the right solution was to restart the standby, which was a bit counterintuitive as the transport is mainly the feature of the primary. This was also fortunate, as there was no problem to perform that at once without service disruption.
Remigium
... about my professional issues
Friday 14 December 2018
Thursday 22 November 2018
Logon storms
https://www.slideshare.net/tanelp/tanel-poder-troubleshooting-complex-oracle-performance-issues-part-1
RATE_LIMIT parameter in listener.ora
_logout_storm_rate - instance parameter
RATE_LIMIT parameter in listener.ora
_logout_storm_rate - instance parameter
Tuesday 2 October 2018
Work with EMCLI
Starting the work
Initially one have to login - without that the commands are not recognized as valid and even help is reduced[oracle@host ~]$ emcli help Summary of commands: Basic Administration Verbs argfile -- Execute emcli verbs from a file help -- Get help for emcli verbs (Usage: emcli help [verb_name]) invoke_ws -- Invoke EM web service. login -- Login to the EM Management Server (OMS) logout -- Logout from the EM Management Server setup -- Setup emcli to work with an EM Management Server status -- List emcli configuration details sync -- Synchronize with the EM Management Server version -- List emcli verb versions or the emcli client version To get the help for a single verb, execute - emcli helpSo we start
[oracle@host ~]$ emcli login -username=sysman Enter password Login successfulNow help looks way more elaborate:
[oracle@host ~]$ emcli help Summary of commands: Basic Administration Verbs argfile -- Execute emcli verbs from a file help -- Get help for emcli verbs (Usage: emcli help [verb_name]) invoke_ws -- Invoke EM web service. login -- Login to the EM Management Server (OMS) logout -- Logout from the EM Management Server setup -- Setup emcli to work with an EM Management Server status -- List emcli configuration details sync -- Synchronize with the EM Management Server version -- List emcli verb versions or the emcli client version ADM Verbs associate_target_to_adm -- Associate a target to an application data model. export_adm -- Export Application Data Model to a specified .xml file. import_adm -- Import Application Data Model from a specified .xml file. [..] WebSphere Target Management Verbs discover_was -- Discover a IBM WebSphere Cell or a IBM Websphere Standalone Server refresh_was -- Refresh IBM WebSphere Cells argus To get the help for a single verb, execute - emcli help
Adding a new blackout reason
## to add: [oracle@host ~]$ emcli add_blackout_reason -name="Target disabled intentionally" Added Blackout Reason "Target disabled intentionally" successfully ## to list [oracle@host ~]$ emcli get_blackout_reasons AS: Application Server Bounce AS: Application Server Configuration Change AS: Application Server Emergency [..]
Tuesday 11 September 2018
ASM: moving disk between groups
SQL> select name||':'||path from v$asm_disk; [..] DATA_0008:/dev/oracleasm/disks/ASMDISK9 -- the operation SQL> alter diskgroup DATA drop disk {chosen name from previous query} [rebalance power {speed between 0 and 11}]; -- the operation progress SQL> select operation, state, est_minutes from v$asm_operation; OPERA STAT EST_MINUTES ----- ---- ----------- REBAL RUN 7 REBAL WAIT 0 SQL> select GROUP_NUMBER,name,STATE from v$asm_disk where name='DATA_0008'; GROUP_NUMBER NAME STATE ------------ ------------------------------ -------- 1 DATA_0008 DROPPING -- ASM alert Tue Sep 11 10:40:51 2018 NOTE: requesting all-instance membership refresh for group=1 Tue Sep 11 10:40:51 2018 GMON updating for reconfiguration, group 1 at 14 for pid 30, osid 19142 Tue Sep 11 10:40:51 2018 NOTE: group 1 PST updated. SUCCESS: grp 1 disk DATA_0008 emptied NOTE: erasing header (replicated) on grp 1 disk DATA_0008 NOTE: erasing header on grp 1 disk DATA_0008 NOTE: process _x000_+asm (19142) initiating offline of disk 8.3915952393 (DATA_0008) with mask 0x7e in group 1 (DATA) without client assisting NOTE: initiating PST update: grp 1 (DATA), dsk = 8/0xe968b109, mask = 0x6a, op = clear Tue Sep 11 10:40:51 2018 GMON updating disk modes for group 1 at 15 for pid 30, osid 19142 Tue Sep 11 10:40:51 2018 NOTE: PST update grp = 1 completed successfully NOTE: initiating PST update: grp 1 (DATA), dsk = 8/0xe968b109, mask = 0x7e, op = clear Tue Sep 11 10:40:51 2018 GMON updating disk modes for group 1 at 16 for pid 30, osid 19142 Tue Sep 11 10:40:51 2018 NOTE: cache closing disk 8 of grp 1: DATA_0008 Tue Sep 11 10:40:51 2018 NOTE: cache closing disk 8 of grp 1: (not open) DATA_0008 Tue Sep 11 10:40:51 2018 NOTE: PST update grp = 1 completed successfully Tue Sep 11 10:40:51 2018 GMON updating for reconfiguration, group 1 at 17 for pid 30, osid 19142 Tue Sep 11 10:40:51 2018 NOTE: cache closing disk 8 of grp 1: (not open) DATA_0008 Tue Sep 11 10:40:51 2018 NOTE: group 1 PST updated. Tue Sep 11 10:40:51 2018 NOTE: membership refresh pending for group 1/0xb68400e (DATA) Tue Sep 11 10:40:51 2018 GMON querying group 1 at 18 for pid 16, osid 10242 GMON querying group 1 at 19 for pid 16, osid 10242 Tue Sep 11 10:40:51 2018 NOTE: Disk DATA_0008 in mode 0x0 marked for de-assignment SUCCESS: refreshed membership for 1/0xb68400e (DATA) NOTE: stopping process ARB0 NOTE: Attempting voting file refresh on diskgroup DATA Tue Sep 11 10:40:54 2018 SUCCESS: rebalance completed for group 1/0xb68400e (DATA) SQL> select operation, state, est_minutes from v$asm_operation; no rows selected SQL> select GROUP_NUMBER,name,STATE from v$asm_disk where name='DATA_0008'; no rows selected SQL> alter diskgroup reco add disk '/dev/oracleasm/disks/ASMDISK9'; Diskgroup altered -- ASM alert entry Tue Sep 11 10:45:43 2018 NOTE: Assigning number (2,1) to disk (/dev/oracleasm/disks/ASMDISK9) NOTE: requesting all-instance membership refresh for group=2 NOTE: Disk 1 in group 2 is assigned fgnum=2 NOTE: discarding redo for group 2 disk 1 NOTE: initializing header (replicated) on grp 2 disk RECO_0001 NOTE: initializing header on grp 2 disk RECO_0001 NOTE: requesting all-instance disk validation for group=2 Tue Sep 11 10:45:44 2018 NOTE: skipping rediscovery for group 2/0xb78400f (RECO) on local instance. Tue Sep 11 10:45:44 2018 NOTE: requesting all-instance disk validation for group=2 Tue Sep 11 10:45:44 2018 NOTE: skipping rediscovery for group 2/0xb78400f (RECO) on local instance. Tue Sep 11 10:45:44 2018 GMON updating for reconfiguration, group 2 at 20 for pid 28, osid 20206 Tue Sep 11 10:45:44 2018 NOTE: group 2 PST updated. Tue Sep 11 10:45:44 2018 NOTE: membership refresh pending for group 2/0xb78400f (RECO) Tue Sep 11 10:45:44 2018 GMON querying group 2 at 21 for pid 16, osid 10242 NOTE: cache opening disk 1 of grp 2: RECO_0001 path:/dev/oracleasm/disks/ASMDISK9 GMON querying group 2 at 22 for pid 16, osid 10242 Tue Sep 11 10:45:44 2018 SUCCESS: refreshed membership for 2/0xb78400f (RECO) Tue Sep 11 10:45:44 2018 SUCCESS: alter diskgroup reco add disk '/dev/oracleasm/disks/ASMDISK9' NOTE: Attempting voting file refresh on diskgroup RECO Tue Sep 11 10:45:45 2018 NOTE: starting rebalance of group 2/0xb78400f (RECO) at power 1 Starting background process ARB0 Tue Sep 11 10:45:45 2018 ARB0 started with pid=31, OS id=20232 NOTE: assigning ARB0 to group 2/0xb78400f (RECO) with 1 parallel I/O
Wednesday 11 July 2018
Broken DG configuration
Few days ago I 've run into the following situation on a DG configuration:
It showed that the problem was generated by the TNS entries on the primary - apparently the TNS alias definitions for the primary and the standby there included some blank characters, which caused the alias to be read improperly in case of dgmgrl - at the same time simple call to those aliases through sqlplus went ok.
Once I removed all the formatting in the tnsnames.ora, whole thing started to work properly. What is a bit interesting is that the errors in the TNS were present only on the primary, while the dgmgrl calls malfunctioned on both ends.
- the primary shipped logs to the standby properly, but reported errors if asked for status in dgmgrl
DGMGRL> show configuration Configuration - dbd Protection Mode: MaxPerformance Databases: db1d - Primary database db2d - Physical standby database Fast-Start Failover: DISABLED Configuration Status: ORA-16610: command "Broker operation" in progress DGM-17017: unable to determine configuration status
- the standby initially did not process archivelogs, but after configuration reset (disable and enable operations) it started to process them, but also reported errors in dgmgrl
- it was possible to log in from the primary to the standby
- it was possible to log in from the standby to the primary
- I copied password file from the primary to the standby to ensure this is the same file
- when calling show database on the primary the errors were reported for the standby
- when calling show database on the standby the errors were reported for the primary
It showed that the problem was generated by the TNS entries on the primary - apparently the TNS alias definitions for the primary and the standby there included some blank characters, which caused the alias to be read improperly in case of dgmgrl - at the same time simple call to those aliases through sqlplus went ok.
Once I removed all the formatting in the tnsnames.ora, whole thing started to work properly. What is a bit interesting is that the errors in the TNS were present only on the primary, while the dgmgrl calls malfunctioned on both ends.
Role order
Lately a colleague of mine run into the problem with too many roles - some account have them 150, while maximally allowed are 148. He wanted to establish the order in which the roles are assigned to the user as obviously only the excess over 148 was thrown away.
He has found that the roles are loaded with the following query:
He has found that the roles are loaded with the following query:
select /*+ connect_by_filtering */ privilege#,level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0;Thus it seems as the the order may be random, though it is possible that usually the data is returned in the order of insertion.
Friday 29 June 2018
ORA-01554: transaction concurrency limit reached reason:no undo segment found with available slot params:0, 0 :
Some time ago I found such error on one of our databases. It is a bit mysterious as there are not much about it on Metalink (few bugs) or on the internet (at least when searching by Google). But indeed the database hit a limit of concurrent transactions in some very short period of time. On the performance chart in OEM there is a high peak with an INSERT to some custom table, with waits on TX - index contention, buffer busy, enq: SQ contention and undo segment tx slot.
The database is working with UNDO_MANAGEMENT=AUTO, which is a default setting and since 10g used very widely, so the administrator may simply forget about rollback segments management. According to the docs with this setting some work parameters relevant to the rollback management are simply ignored (TRANSACTIONS, TRANSACTIONS_PER_ROLLBACK_SEGMENT), while the number of transactions is limited by the undo tablespace size.
So it seems we should not experience this error, but rather session suspension and the error about lack of space in the undo tablespace.
Yet here are are. It is clear from the numbers below, that still there is a space in the undo (up to 4598 undo blocks used
What I could check is this:
Since the midnight the number of undo blocks consumed was 63475 with 524416 available blocks (so ~12%) and maximum query length was below 1h. Thus definitely we were far below the undo space limit. From this figures it is clear that still it is possible to hit an error typical for MANUAL undo management scenario, while on AUTO - as here we are.
And here few words of wisdom from Jonathan Lewis:
And Julian Dyke's presentation:
With all this in mind I would say that 1000 times higher transaction rate is in this particular case the main reason for the error. Apparently the db is not able to extend its concurrent transactions rate capability fast enough to respond to extended demand. Possible action here could be initial setting of _rollback_segment_count to a higher value than 12, which are set here.
The database is working with UNDO_MANAGEMENT=AUTO, which is a default setting and since 10g used very widely, so the administrator may simply forget about rollback segments management. According to the docs with this setting some work parameters relevant to the rollback management are simply ignored (TRANSACTIONS, TRANSACTIONS_PER_ROLLBACK_SEGMENT), while the number of transactions is limited by the undo tablespace size.
So it seems we should not experience this error, but rather session suspension and the error about lack of space in the undo tablespace.
Yet here are are. It is clear from the numbers below, that still there is a space in the undo (up to 4598 undo blocks used
What I could check is this:
select BEGIN_TIME, UNDOTSN, UNDOBLKS, TXNCOUNT, MAXQUERYLEN, MAXQUERYID, MAXCONCURRENCY, ACTIVEBLKS, UNEXPIREDBLKS, EXPIREDBLKS, TUNED_UNDORETENTION from v$undostat where begin_time between to_date('2018-06-25 14', 'yyyy-mm-dd hh24') and to_date('2018-06-25 16', 'yyyy-mm-dd hh24') order by begin_time;
BEGIN_TIME | UNDOBLKS | TXNCOUNT | MAXQUERYLEN | MAXCONCURRENCY | ACTIVEBLKS | UNEXPIREDBLKS | EXPIREDBLKS | TUNED_UNDORETENTION |
2018-06-25 14:39 | 0 | 78 | 1652 | 0 | 160 | 0 | 7168 | 2372 |
2018-06-25 14:49 | 1411 | 91 | 1653 | 2 | 160 | 0 | 7168 | 2373 |
2018-06-25 14:59 | 4598 | 104225 | 1653 | 131 | 160 | 4736 | 5888 | 2373 |
2018-06-25 15:09 | 2169 | 103321 | 933 | 311 | 160 | 6656 | 4352 | 1653 |
2018-06-25 15:19 | 2 | 282 | 1734 | 1 | 160 | 8064 | 2816 | 2454 |
And here few words of wisdom from Jonathan Lewis:
Taking a quick guess:
An (automatic) undo segment has 34 slots in the transaction table (the thing in the undo segment header).
If you and up with a small number N of undo segments then you can't have more than 34 * N concurrent transactions active because each one needs a transaction table slot and the next attempt to start a transaction would see ORA-01554
It seems a little unlikely that this would happan in typical system since Oracle would usually end up stealing an extent from an existing undo segment to use it as the first extent of a new undo segment. But it wouldn't be hard to set up a mechanism that started with a small number of undo segments in a relatively small undo tablespace - started some transactions (then leave them idle forever after the first change - e.g. they might have come in from a remote database) that made it impossible to free any extents, then did a lot of work to allocate all the extents (but still left a little space in allocated extents for continuting transactions), then started lots of transactions until you hit the error.
And Julian Dyke's presentation:
Undo segments are allocated at instance startup.
Undo segments can be added dynamically.
Each undo segment header contains
- Pool of free undo extents
- Set of undo slots
One undo slot is allocated to each transaction.
Undo slot contains list of undo extents.
Extents can migrate from one undo segment to another.
Undo slots are used cyclically:
- remain in header as long as possible
- reduces probability of ORA-01555: Snapshot too old
With all this in mind I would say that 1000 times higher transaction rate is in this particular case the main reason for the error. Apparently the db is not able to extend its concurrent transactions rate capability fast enough to respond to extended demand. Possible action here could be initial setting of _rollback_segment_count to a higher value than 12, which are set here.
Subscribe to:
Posts (Atom)