Tuesday, 16 June 2015

Restore points

Below You may find a 'session' from enabling restore point technology.
-- first try old a hacker method - just start do it ;-)
SQL> create restore point rp1 guarantee flashback database;
create restore point rp1 guarantee flashback database
*
ERROR at line 1:
ORA-38784: Cannot create restore point 'RP1'.
ORA-38785: Media recovery must be enabled for guaranteed restore point.
-------------------------
-- thus I see one needs to set db into ARCHIVELOG mode
-- despite that we need also settings for db_recovery_file_dest
-- though we do not need to enable flashback mode

-- creating pfile in order to edit settings
SQL> create pfile from spfile;

File created.

-- create directory for fast recovery area (FRA)
SQL> !mkdir [path]

-- add settings for FRA
SQL> !vim $ORACLE_HOME/dbs/initcdw.ora
--*.db_recovery_file_dest_size=10G
--*.db_recovery_file_dest=[path]'
SQL> create spfile from pfile;

File created.

-- now enabling ARCHIVELOG mode
SQL> startup nomount

ORACLE instance started.

Total System Global Area 2137886720 bytes
Fixed Size      2228200 bytes
Variable Size    503316504 bytes
Database Buffers  1627389952 bytes
Redo Buffers      4952064 bytes

-- create directory for archive logs
SQL> !mkdir /ora_csdb/oradata/CDW/arch

-- set local destination for archive logs to newly created directory
-- of course it was possible to add it to pfile when editing
-- in case of not setting a custom location the archive logs are stored
-- in $ORACLE_HOME/dbs
SQL> alter system set 
SQL> log_archive_dest_1= 'LOCATION=/ora_csdb/oradata/CDW/arch' 
SQL> ;

System altered.

-- switching to ARCHIVELOG mode
SQL> alter database mount;

Database altered.

SQL> alter database archivelog;

Database altered.

SQL> alter database open;

Database altered.

-- there is no switch of flashback on - so as You see this is quite minimal scenario
-- enabling flashback is not required but at a price - one cannot restore to
-- point in time between snapshots
-- recovery writer RVWR is started
-- as You may see, the technology makes use of flashback fascilites

-- creating a restore point
-- as You see now all works properly 
SQL> create  restore point rp1 guarantee flashback database;
[..alert.log..]
Wed Jun 10 12:44:47 2015
Starting background process RVWR
Wed Jun 10 12:44:47 2015
RVWR started with pid=28, OS id=17240 
Allocated 8388608 bytes in shared pool for flashback generation buffer
Wed Jun 10 12:45:27 2015
Created guaranteed restore point RP1
[..alert.log..]
Restore point created.

-- now preparations for a short test
SQL> 
SQL> conn / as sysdba
Connected.
SQL> alter user rems identified by simple1;

User altered.

SQL> conn rems/simple1
Connected.
SQL> create table t1 (a number)
  2  ;

Table created.

SQL> insert into t1 select object_id from dba_objects;

473175 rows created.

SQL> commit;

Commit complete.

-- now we return to the state marked by RP1
SQL> conn / as sysdba
Connected.

-- we need to stop the database and mount it
SQL> flashback database to restore point rp1;
flashback database to restore point rp1
*
ERROR at line 1:
ORA-38757: Database must be mounted and not open to FLASHBACK.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.

Total System Global Area 2137886720 bytes
Fixed Size      2228200 bytes
Variable Size    503316504 bytes
Database Buffers  1627389952 bytes
Redo Buffers      4952064 bytes
Database mounted.
SQL> flashback database to restore point rp1;
[..alert.log..]
Wed Jun 10 13:12:29 2015
Flashback Restore Start
Wed Jun 10 13:13:05 2015
Flashback Restore Complete
Flashback Media Recovery Start
 started logmerger process
Wed Jun 10 13:15:29 2015
db_recovery_file_dest_size of 10240 MB is 0.30% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Wed Jun 10 13:17:42 2015
Parallel Media Recovery started with 2 slaves
Wed Jun 10 13:17:43 2015
Recovery of Online Redo Log: Thread 1 Group 7 Seq 20393 Reading mem 0
  Mem# 0: /ora_csdb/oradata/CDW/onlinelog/o1_mf_7_97grvlnx_.log
  Mem# 1: /ora_csdb/oradata/CDW/onlinelog/o1_mf_7_97grvp66_.log
Incomplete Recovery applied until change 117062103544 time 06/10/2015 12:45:28
Flashback Media Recovery Complete
Wed Jun 10 13:20:53 2015
Completed: flashback database to restore point rp1
[..]
Flashback complete.

-- after flashback we have to open with RESETLOGS
SQL> alter database open
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open

SQL> alter database open resetlogs;    


SQL> alter system set log_archive_format='cdw_%d_%r_%T_%S.arc';

-- %r, %T/%t, %S/%s have to be contained within the archive log format



Wednesday, 10 June 2015

Today's action

Event

During the night we've got some messages from one of the databases' hosts with warning with message about storage space threshold has been crossed. This was a resource on which we put archivelogs, so quite important.

Action

One of my colleagues started with launching manually standard script for backup. Normally such script is called by cron in scheduled moments. Unfortunately and unexpectedly the database has produced a massive number of archivelogs, on which the system has not been prepared at all - thus we saw that apparently there is something wrong with application processing. But then it went to the point the backup script has not been working fast enough to free storage space for further logs and due to some other circumstances jammed itself and backup system administrator has to kill it.

The diagnosis was done with the following tools:
  • OEM - with this we were able to find the malicious application and call for its restart
  • an ad hoc script showed that the problematic transactions frequently finished with rollback, which was the direct cause of massive archivelogs growth. The script allows for realizing the status of a transaction and its progress in the form of used_ublk (i.e. undo blocks used in the transaction). Subsequent calls show the change in number of used blocks - in case of rollbacks the value diminishes of course.
    set linesize 300 pagesize 1000
    col rollback for a25
    
    SELECT s.username,
           s.sid,
           s.serial#,
           t.used_ublk,
           t.used_urec,
           rs.segment_name,
           round(r.rssize/1024/1024) rs_sz_mb,
           r.status,
           case 
           when bitand(t.flag,power(2,7)) > 0 
           then 'rollback in progress'
           else 'other'
           end as "F Status"
    FROM   v$transaction t,
           v$session s,
           v$rollstat r,
           dba_rollback_segs rs
    WHERE  s.saddr = t.ses_addr
    AND    t.xidusn = r.usn
    AND   rs.segment_id = t.xidusn
    ORDER BY t.used_ublk DESC
    
Finally the application has been restarted, and the application behaviour returned to its normal pace.

Clean-up

Yet during the diagnosis phase I have started again the backup script to free the space, but the script finished with errors due to earlier move of some archivelogs to another location. I did not know at that time what happened with those logs and unable to contact my colleague, so crosschecked the archivelogs and run backup again. This time all has been working properly. In the meantime sysadmins added some 20G to the archivelog storage resource because still the archivelog production has outpaced the backup process. After the application restart there has been some 20 minutes when some transactions still has been rolled back, but after this time the used storage space has dropped to usual state.
The last part of clean-up was to synchronize rman catalog, register the moved archivelogs and backup them as well.
resync catalog ;
catalog start with '[path]';
backup archivelog sequence between 40556 and 40999 delete all input;

Thursday, 7 May 2015

Source of great tech papers (external link)

Cary Millsap's company Method-R published some tech papers - really interesting thing.

Wednesday, 6 May 2015

Asynch descriptor resize (external link)

Excellent analysis of the event in cases it is seen high in query time by Tanel Poder.
We have experienced this behaviour on some databases, but usually it is bound with a particular session, so we used to kill and repeat the action.

Thursday, 2 April 2015

Zabbix and Oracle DRCP

Here short note on possible usage of database resident connection pool feature with Zabbix. This feature allows for pooling database server processes (so they stay ready to use in opposition to standard dedicated processes, which are created at the connection start and destroyed at the end). While at the beginning it seemed to be a interesting issue, it has limited impact on the Zabbix performance.
The main reason for such situation is the fact the Zabbix server keeps persistent connections to a database - in such case DRCP would be a source of problems rather than a helper. Thus only the GUI part may be a subject to such configuration. Furthermore in future versions of Zabbix the database logic is planned to be located in the API, which in turn will be a part of the Zabbix server as well (while the GUI will call the API), and with such architecture the DRCP will be no longer an option.
Anyway this is still interesting feature assuming one makes an extensive use of the GUI, while on version below 3.X
-- this call creates a pool on a database side - first argument will 
-- become a pool name in the future (currently there is only 1 pool)
begin
  dbms_connection_pool.configure_pool(
    null, minsize=5, maxsize=20, inactivity_timeout=>300, max_think_time=>600);
end;
/
-- this call starts the pool
exec dbms_connection_pool.start_pool();
-- this call stops the pool
exec dbms_connection_pool.stop_pool();
-- calls like the one below allows pool parameters change
begin
  dbms_connection_pool.ALTER_PARAM(
    null, param_name=>'session_cached_cursors', param_value=>'50');
end;
/

Monday, 9 March 2015

Jammed AQ notification

Problem definition

One of frequently performed activities in my job is preparing a test environment based on data from production. This is quite lengthy task as the main goal is to create a database as close to a productional one as possible. There are of course some differences - cheaper platform is one of them, another one is limited data volume, yet another limited number of external systems - in turn they implicate further differences.
Thus the most effective way to achieve such goal is to leverage the Data Pump technology and number of home made scripts for partially "manual".
From time to time, while creating such environment, shows up a problem with the automatic AQ notification feature, which is used quite extensively, so we have to get it working properly. One of scenarios, when it comes to the AQ notification jam is the following:
  • we try to check the AQ notification health
  • we run a script, which creates a demo queue and put there a message
  • however during load we implicitly get the situation described here
  • the notification job fails
  • we (i.e. AQ) are stuck
After such sequence of events further enqueues to the AQ queue, which should run automatic callback, fail. I am unable to provide exact reason as there may be a plenty of them - old messages left in notification queue (SYS.AQ_SRVNTFN_TABLE_1), error messages in corresponding error queue (SYS.AQ$_AQ_SRVNTFN_TABLE_1_E), automatic AQ callback job failure due to ORA-28031 and some internal reminiscences after that and so on, but I provide here a sequence of activities to fix it.

Solution

The main goal of the solution is to get rid of the old messages in the queue. First one may look for a post like this one, which tells us, what to do with messages, which are moved to the error queue. In general it is necessary to enable the error queue for dequeue and get all the messages from there.
Next step is to run very similar piece of code to dequeue "manually" all the messages from notification queue - after that the AQ notification works like a new. Additionally I restarted the database, but not sure if that was important. It may be possible You find a locking problem, when dealing with queues - in my case there was a job, which was fired on the notification queue and I was unable e.g. to enable the error queue for dequeue - I managed with it by killing the offending session.