Remigium: 07/06/2015

Wednesday, 10 June 2015

Today's action

Event

During the night we've got some messages from one of the databases' hosts with warning with message about storage space threshold has been crossed. This was a resource on which we put archivelogs, so quite important.

Action

One of my colleagues started with launching manually standard script for backup. Normally such script is called by cron in scheduled moments. Unfortunately and unexpectedly the database has produced a massive number of archivelogs, on which the system has not been prepared at all - thus we saw that apparently there is something wrong with application processing. But then it went to the point the backup script has not been working fast enough to free storage space for further logs and due to some other circumstances jammed itself and backup system administrator has to kill it.

The diagnosis was done with the following tools:

OEM - with this we were able to find the malicious application and call for its restart

an ad hoc script showed that the problematic transactions frequently finished with rollback, which was the direct cause of massive archivelogs growth. The script allows for realizing the status of a transaction and its progress in the form of used_ublk (i.e. undo blocks used in the transaction). Subsequent calls show the change in number of used blocks - in case of rollbacks the value diminishes of course.

set linesize 300 pagesize 1000
col rollback for a25

SELECT s.username,
       s.sid,
       s.serial#,
       t.used_ublk,
       t.used_urec,
       rs.segment_name,
       round(r.rssize/1024/1024) rs_sz_mb,
       r.status,
       case 
       when bitand(t.flag,power(2,7)) > 0 
       then 'rollback in progress'
       else 'other'
       end as "F Status"
FROM   v$transaction t,
       v$session s,
       v$rollstat r,
       dba_rollback_segs rs
WHERE  s.saddr = t.ses_addr
AND    t.xidusn = r.usn
AND   rs.segment_id = t.xidusn
ORDER BY t.used_ublk DESC

Finally the application has been restarted, and the application behaviour returned to its normal pace.

Clean-up

Yet during the diagnosis phase I have started again the backup script to free the space, but the script finished with errors due to earlier move of some archivelogs to another location. I did not know at that time what happened with those logs and unable to contact my colleague, so crosschecked the archivelogs and run backup again. This time all has been working properly. In the meantime sysadmins added some 20G to the archivelog storage resource because still the archivelog production has outpaced the backup process. After the application restart there has been some 20 minutes when some transactions still has been rolled back, but after this time the used storage space has dropped to usual state.
The last part of clean-up was to synchronize rman catalog, register the moved archivelogs and backup them as well.

resync catalog ;
catalog start with '[path]';
backup archivelog sequence between 40556 and 40999 delete all input;