Monday, 12 May 2014

Zabbix on Oracle RDBMS and ORA-00001

Scenario

We performed today a small experiment with Zabbix - fast database restart with hope to keep the application alive based on the memory buffer and data kept in it during the database outage. This approach has failed even if sometimes we see Zabbix working through some time even with the database disabled.
Just after immediate database restart we observed a whole bunch of errors like the one below:
18264:20140512:134056.833 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX.EVENTS_PK) violated [insert into 
events (eventid,source,object,objectid,clock,ns,value,value_changed) values (610764154,1,1,157,1399894856,0,1,0)]

Diagnosis

I set the diagnosis completely based on the symptoms, so it may seem not exact, but I suppose it is proper enough. In general it is bound with the mechanism for creating the ids for rows in Zabbix tables. The mechanism looks like the following:
  • an id is built from the value selected from the ids table for a row designated for a particular table increased by 1
  • the new value is used with a new row
  • the nextid value in the ids table is increased by one
Thus the ids table is used as an equivalent of sequence functionality. It seems however, that the values in this table not necessarily are chosen and updated within one transaction. Thus in the scenario like the one below we end up with the nextid value lower than the maximal value in tables to which Zabbix tries to insert new rows and those inserts fail with the error presented above.
In particular the frequent table with which we have trouble is the events table. Nothing strange though

Solution

UPDATE ZABBIX.IDS SET NEXTID=(select max(eventid) from events)+1 WHERE TABLE_NAME='events' AND FIELD_NAME='eventid';
commit;

Compatibility on active-passive configuration and ORA-600

Today we experienced some small issues on one of the production environments. And btw. the following story I would like to share with.

Situation overview

The database in question works in active-passive configuration with managed standby recovery enabled. From time to time it has some problems with memory settings. As this issue repeats for some time we are already quite experienced with it and act appropriately. And because of nature of this database traffic we are able to run some experiments. And today was also such day.

Compatible parameter

Among other adjustments and settings we've changed the compatible parameter from 11.2.0.0.0 to 11.2.0.3.0. But the change was done only on the active node.

Output

The first what we've got was the ORA-600 error.
[.. from alert.log ..]
Mon May 12 08:48:10 2014
RFS[13]: Assigned to RFS process 7418
RFS[13]: Selected log 7 for thread 1 sequence 140861 dbid -1879016249 branch 793971529
Errors in file /oracle/diag/rdbms/zabbix_node2/zabbix/trace/zabbix_rfs_7412.trc  (incident=192281):
ORA-00600: internal error code, arguments: [2730], [331], [1], [5], [140862], [140862], [512], [512], [], [], [], []

[.. from one of trace files ..]
*** ACTION NAME:() 2014-05-12 08:48:09.667
 
Identified standby redo log 5 for implicit mid-log reconnect
DDE: Problem Key 'ORA 600 [2730]' was flood controlled (0x2) (incident: 192281)
ORA-00600: internal error code, arguments: [2730], [331], [1], [5], [140862], [140862], [512], [512], [], [], [], []
Exception 600 received while writing lno 5 thread 1 seq 140862
*** 2014-05-12 08:48:10.517 4638 krsb.c
krsb_stream_write: Error 600 while attempting to write buffer
krsv_dsga: Dispatching RFS shutdown notification
Of course as all the ORA-600 quite misterious.
As I was not sure at which archivelog the problem appeared and was too lazy to look through the alert log, I simply run in SQL*Plus (nota bene 5th and 6th arguments of the ORA-600 indicate the archivelog sequence on which the problem emerges, we had already few such archivelogs, so one had to find the first occurence of the problem anyway)
recover standby database until cancel
. Now things were much more clear - at one of the archivelogs I have simply got an info, that the server can not process the archivelog with compatibility set to higher than currently set. It was enough to set the compatibility accordingly and restart the standby.