Friday, 11 December 2015

A note about Data pump behaviour while blocked

Lately I manage several db environments, on which I perform frequent loads with the Data Pump tools. As those loads are sometimes quite large (up to 0.5T), I constantly meet the situation, when the space designated for archivelogs is filled completely, which in turn causes archival process to stop.
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance YYYYYY - Archival Error
ORA-16038: log 1 sequence# 3208 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 1 thread 1: '+RECO/YYYYYY/onlinelog/group_1.257.895415461'
ORA-00312: online log 1 thread 1: '+DATA/YYYYYY/onlinelog/group_1.257.895415461'
Fri Dec 11 11:03:35 2015
Archiver process freed from errors. No longer stopped
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance YYYYYY - Archival Error
ORA-16014: log 1 sequence# 3208 not archived, no available destinations
I used to load data to databases in the NOARCHIVELOG mode, where there is no such issue at all. In many cases the archival break means that a Data Pump job stops for a time, until the space for archivelogs is freed and available for future archivelogs. Thus it does not matter when the space will be freed (by backup).
But there are scenarios, when I have faced the following behaviour. When a Data Pump job meets the archival error, it fails with the following:
ORA-39065: unexpected master process exception in RECEIVE
ORA-39078: unable to dequeue message for agent MCP from 
  queue "KUPC$C_1_20151210094644"
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 95
ORA-06512: at "SYS.KUPC$QUE_INT", line 610
ORA-04021: timeout occurred while waiting to lock object
Job "ZZZZZZ"."SYS_IMPORT_SCHEMA_01" stopped due to fatal error 
  at Thu Dec 10 11:54:39 2015 elapsed 0 02:07:55
It seems to me that such situations are possible, when parallel job slaves are run. They die, waiting for some object to lock and possibly it is simply the problem with communication between the slaves and the master, when db freeze disables this communication. However this is not a final error. Facing this one may attach to the Data Pump session. It seems it freezes and does not recover automatically, when the work is possible again. In such situation the user has to call CONTINUE command - then the job recreates the slaves and the job steps further.
Job SYS_IMPORT_SCHEMA_01 has been reopened at Thu Dec 10 12:23:51 2015
Restarting "ZZZZZZ"."SYS_IMPORT_SCHEMA_01":  ZZZZZZ/********
  parfile=ZZZZZZ.par

Monday, 19 October 2015

Silent installation of client

There are actually 2 components of such installation: response file, which consists all the arguments for installation and shell call

Response file

Below example of a response file
oracle.install.responseFileVersion=/oracle/install/rspfmt_clientinstall_response_schema_v11_2_0

SELECTED_LANGUAGES=en
UNIX_GROUP_NAME=oinstall
INVENTORY_LOCATION=/u01/app/oracle/oraInventory
ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/client_1
ORACLE_BASE=/u01/app/oracle

oracle.install.client.installType=Custom
oracle.install.client.upgrading=false
## I cut the line below for better display
oracle.install.client.customComponents="oracle.sqlj:11.2.0.4.0","oracle.rdbms.util:11.2.0.4.0","oracle.javavm.client:11.2.0.4.0","oracle.sqlplus:11.2.0.4.0","oracle.dbjava.jdbc:11.2.0.4.0","oracle.ldap.client:11.2.0.4.0","oracle.rdbms.oci:11.2.0.4.0","oracle.precomp:11.2.0.4.0","oracle.xdk:11.2.0.4.0","oracle.network.aso:11.2.0.4.0","oracle.assistants.oemlt:11.2.0.4.0","oracle.oraolap.mgmt:11.2.0.4.0","oracle.network.client:11.2.0.4.0","oracle.network.cman:11.2.0.4.0","oracle.network.listener:11.2.0.4.0","oracle.ordim.client:11.2.0.4.0","oracle.odbc:11.2.0.4.0","oracle.has.client:11.2.0.4.0","oracle.dbdev:11.2.0.4.0","oracle.rdbms.scheduler:11.2.0.4.0"

oracle.install.client.schedulerAgentHostName=10.0.54.2
oracle.install.client.schedulerAgentPortNumber=1025
oracle.installer.autoupdates.option=

oracle.installer.autoupdates.downloadUpdatesLoc=

AUTOUPDATES_MYORACLESUPPORT_USERNAME=
AUTOUPDATES_MYORACLESUPPORT_PASSWORD=

PROXY_HOST=
PROXY_PORT=
PROXY_USER=
PROXY_PWD=
PROXY_REALM=
The example above is excerpt from the example file provided by Oracle without comments and a number of other options. The most important here is the Oracle base and home locations and inventory location.
The inventory should be placed in some independent directory, where one may change permissions without affecting the other binaries trees. This is especially important if one plans to install more products as all of them require access to such inventory and may be owned by different users.

Shell call

There is several options to the runInstaller exec. The most simple call looks like this:
./runInstaller -silent -responseFile 
Sometimes -ignorePrereq is useful if one knows, what one's doing.

Pros and cons

The response file preparation may become kind of trouble, but it is worth the effort - after all it is well to apply some convention and such file may become kind of policy document in the client installation regard.
The most important is the installation is way faster than with GUI.

Tuesday, 22 September 2015

Trouble with an installation on AIX

This time I have got quite a simple task - to install Oracle database binaries on a host with AIX. Work as usual.
But after copying files the installer process displayed an error. Below excerpt from the log.
ld: 0711-866 INTERNAL ERROR: Output symbol table size miscalculated.

INFO: ld: 0711-759 INTERNAL ERROR: Source file xoutxx.c, line 574.
        Depending on where this product was acquired, contact your service
        representative or the approved supplier.

INFO: make: 1254-004 The error code from the last command is 16.


Stop.

INFO: make: 1254-004 The error code from the last command is 2.


Stop.

INFO: End output from spawned process.
INFO: ----------------------------------
INFO: Exception thrown from action: make
Exception Name: MakefileException
Exception String: Error in invoking target 'links proc gen_pcscfg' of makefile 
      '/oracle/app/product/11.2.0.2/precomp/lib/ins_precomp.mk'. See 
      '/oracle/oraInventory/logs/installActions2015-09-22_08-11-37AM.log' for details.
Exception Severity: 1


As this is 11.2.0.2 version and the AIX is 7.1 I was afraid this may be due to some lacks in system filesets or so. It seems not so - all the details You may find in "AIX: ERROR IN INVOKING TARGET 'LINKS PROC GEN_PCSCFG' OF MAKEFILE, ld: 0711-593 SEVERE ERROR: Symbol C_BSTAT (Doc ID 985563.1)" on the Metalink.

Wednesday, 9 September 2015

ORA-32598

Today we hit (while creating a test environment) ORA-32598: user names cannot be changed in ALTER DATABASE LINK command. An element of creating a test environment is the password change for users and accompanying change in db links' passwords.
As this is the 11g version, we are able to leverage the command ALTER DATABASE LINK.
We run something like:
ALTER DATABASE LINK link1 CONNECT TO user1 identified by "user1test";
And the db engine complains about changing user name, but the name seems perfectly fine. We know that this is copied from a production environment where this link works properly. But wait...
What about the etarnal problem with double quotes and names?
This is the diagnosis and solution is clear then - the db link creator in his/her command used lower case letters and enveloped the name in double quotes. On the production it works even if the user on a remote db is named traditionally (i.e. its name is stored in upper case) as apparently the connect string on the remote side is resolved as usually i.e. in a case-insensitive way.
But on the local database the user name is in lower case so if one does not use double quotes, the engine does not find the match between user already set and the name from the command. Thus the proper command should look like the following:
ALTER DATABASE LINK link1 CONNECT TO "user1" identified by "user1test";
And now all works fine...

After OS upgrade ...

The only important thing after the OS upgrade on a host with Oracle db is to relink binaries.
## before it is good to find all the files owned by root
## after the relink operation they would become again owned by oracle
[oracle@host1 ~]$ find $ORACLE_HOME -user root -exec ls -l {} \;
-rw-r-----   1 root     oinstall    1534 Dec 21  2005 /oracle/product/11.2.0.3/rdbms/admin/externaljob.ora
-rwsr-x---   1 root     oinstall 1340408 Sep 25  2011 /oracle/product/11.2.0.3/bin/oradism
-rws--x---   1 root     oinstall   29216 Jun 21  2013 /oracle/product/11.2.0.3/bin/nmo
-rws--x---   1 root     oinstall   21624 Jun 21  2013 /oracle/product/11.2.0.3/bin/nmb
-rwsr-x---   1 root     oinstall 1277904 Jun 21  2013 /oracle/product/11.2.0.3/bin/extjob
-rws--x---   1 root     oinstall   93296 Jun 21  2013 /oracle/product/11.2.0.3/bin/nmhs
-rwsr-x---   1 root     oinstall   26824 Jun 21  2013 /oracle/product/11.2.0.3/bin/jssu


[oracle@host1 ~]$ cd $ORACLE_HOME
[oracle@host1 ~]$ ./bin/relink all

## after the relink operation some of binaries, which have to be owned by root are 
## again owned by oracle; after the installation the permissions are set by root.sh
## now it is enough to run the commands below, as root.sh does more things

[oracle@host1 ~]$ chown root $ORACLE_HOME/rdbms/admin/externaljob.ora
[oracle@host1 ~]$ chmod 640 $ORACLE_HOME/rdbms/admin/externaljob.ora
[oracle@host1 ~]$ chown root $ORACLE_HOME/bin/oradism
[oracle@host1 ~]$ chmod 4750 $ORACLE_HOME/bin/oradism
[oracle@host1 ~]$ chown root $ORACLE_HOME/bin/nmo
[oracle@host1 ~]$ chmod 4710 $ORACLE_HOME/bin/nmo 
[oracle@host1 ~]$ chown root $ORACLE_HOME/bin/nmb
[oracle@host1 ~]$ chmod 4710 $ORACLE_HOME/bin/nmb
[oracle@host1 ~]$ chown root $ORACLE_HOME/bin/extjob
[oracle@host1 ~]$ chmod 4750 $ORACLE_HOME/bin/extjob
[oracle@host1 ~]$ chown root $ORACLE_HOME/bin/nmhs
[oracle@host1 ~]$ chmod 4710 $ORACLE_HOME/bin/nmhs
[oracle@host1 ~]$ chown root $ORACLE_HOME/bin/jssu
[oracle@host1 ~]$ chmod 4750 $ORACLE_HOME/bin/jssu

Monday, 31 August 2015

Error on auto execute of job SYS.ORA$_AT_SA_SPC_SY (external link)

Today I've got a ticket to service with the content more or less like the following:
ORA-12012: error on auto execute of job "SYS"."ORA$AT_SA_SPC_SY_1202"
ORA-20000: ORU-10027: buffer overflow, limit of 20000 bytes
ORA-06512: at "SYS.DBMS_ADVISOR", line 201
ORA-06512: at "SYS.DBMS_SPACE", line 2480
ORA-06512: at "SYS.DBMS_SPACE", line 2553

The issue is not a big deal for me - a problem with space advisor is not a critical thing IMHO, but by the way I found short nice notice on the subject http://qdosmsq.dunbar-it.co.uk/blog/2013/01/oraat_sa_spc_sy-jobs-failing/

Tuesday, 16 June 2015

Restore points

Below You may find a 'session' from enabling restore point technology.
-- first try old a hacker method - just start do it ;-)
SQL> create restore point rp1 guarantee flashback database;
create restore point rp1 guarantee flashback database
*
ERROR at line 1:
ORA-38784: Cannot create restore point 'RP1'.
ORA-38785: Media recovery must be enabled for guaranteed restore point.
-------------------------
-- thus I see one needs to set db into ARCHIVELOG mode
-- despite that we need also settings for db_recovery_file_dest
-- though we do not need to enable flashback mode

-- creating pfile in order to edit settings
SQL> create pfile from spfile;

File created.

-- create directory for fast recovery area (FRA)
SQL> !mkdir [path]

-- add settings for FRA
SQL> !vim $ORACLE_HOME/dbs/initcdw.ora
--*.db_recovery_file_dest_size=10G
--*.db_recovery_file_dest=[path]'
SQL> create spfile from pfile;

File created.

-- now enabling ARCHIVELOG mode
SQL> startup nomount

ORACLE instance started.

Total System Global Area 2137886720 bytes
Fixed Size      2228200 bytes
Variable Size    503316504 bytes
Database Buffers  1627389952 bytes
Redo Buffers      4952064 bytes

-- create directory for archive logs
SQL> !mkdir /ora_csdb/oradata/CDW/arch

-- set local destination for archive logs to newly created directory
-- of course it was possible to add it to pfile when editing
-- in case of not setting a custom location the archive logs are stored
-- in $ORACLE_HOME/dbs
SQL> alter system set 
SQL> log_archive_dest_1= 'LOCATION=/ora_csdb/oradata/CDW/arch' 
SQL> ;

System altered.

-- switching to ARCHIVELOG mode
SQL> alter database mount;

Database altered.

SQL> alter database archivelog;

Database altered.

SQL> alter database open;

Database altered.

-- there is no switch of flashback on - so as You see this is quite minimal scenario
-- enabling flashback is not required but at a price - one cannot restore to
-- point in time between snapshots
-- as You may see there is special state for such situation provided 
-- in the FLASHBACK_ON column of V$DATABASE
SQL> select flashback_on from v$database;

FLASHBACK_ON
------------------
RESTORE POINT ONLY


-- recovery writer RVWR is started
-- as You may see, the technology makes use of flashback fascilites

-- creating a restore point
-- as You see now all works properly 
SQL> create  restore point rp1 guarantee flashback database;
[..alert.log..]
Wed Jun 10 12:44:47 2015
Starting background process RVWR
Wed Jun 10 12:44:47 2015
RVWR started with pid=28, OS id=17240 
Allocated 8388608 bytes in shared pool for flashback generation buffer
Wed Jun 10 12:45:27 2015
Created guaranteed restore point RP1
[..alert.log..]
Restore point created.

-- now preparations for a short test
SQL> 
SQL> conn / as sysdba
Connected.
SQL> alter user rems identified by simple1;

User altered.

SQL> conn rems/simple1
Connected.
SQL> create table t1 (a number)
  2  ;

Table created.

SQL> insert into t1 select object_id from dba_objects;

473175 rows created.

SQL> commit;

Commit complete.

-- now we return to the state marked by RP1
SQL> conn / as sysdba
Connected.

-- we need to stop the database and mount it
SQL> flashback database to restore point rp1;
flashback database to restore point rp1
*
ERROR at line 1:
ORA-38757: Database must be mounted and not open to FLASHBACK.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.

Total System Global Area 2137886720 bytes
Fixed Size      2228200 bytes
Variable Size    503316504 bytes
Database Buffers  1627389952 bytes
Redo Buffers      4952064 bytes
Database mounted.
SQL> flashback database to restore point rp1;
[..alert.log..]
Wed Jun 10 13:12:29 2015
Flashback Restore Start
Wed Jun 10 13:13:05 2015
Flashback Restore Complete
Flashback Media Recovery Start
 started logmerger process
Wed Jun 10 13:15:29 2015
db_recovery_file_dest_size of 10240 MB is 0.30% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Wed Jun 10 13:17:42 2015
Parallel Media Recovery started with 2 slaves
Wed Jun 10 13:17:43 2015
Recovery of Online Redo Log: Thread 1 Group 7 Seq 20393 Reading mem 0
  Mem# 0: /ora_csdb/oradata/CDW/onlinelog/o1_mf_7_97grvlnx_.log
  Mem# 1: /ora_csdb/oradata/CDW/onlinelog/o1_mf_7_97grvp66_.log
Incomplete Recovery applied until change 117062103544 time 06/10/2015 12:45:28
Flashback Media Recovery Complete
Wed Jun 10 13:20:53 2015
Completed: flashback database to restore point rp1
[..]
Flashback complete.

-- after flashback we have to open with RESETLOGS
SQL> alter database open
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open

SQL> alter database open resetlogs;    


SQL> alter system set log_archive_format='cdw_%d_%r_%T_%S.arc';

-- %r, %T/%t, %S/%s have to be contained within the archive log format



Wednesday, 10 June 2015

Today's action

Event

During the night we've got some messages from one of the databases' hosts with warning with message about storage space threshold has been crossed. This was a resource on which we put archivelogs, so quite important.

Action

One of my colleagues started with launching manually standard script for backup. Normally such script is called by cron in scheduled moments. Unfortunately and unexpectedly the database has produced a massive number of archivelogs, on which the system has not been prepared at all - thus we saw that apparently there is something wrong with application processing. But then it went to the point the backup script has not been working fast enough to free storage space for further logs and due to some other circumstances jammed itself and backup system administrator has to kill it.

The diagnosis was done with the following tools:
  • OEM - with this we were able to find the malicious application and call for its restart
  • an ad hoc script showed that the problematic transactions frequently finished with rollback, which was the direct cause of massive archivelogs growth. The script allows for realizing the status of a transaction and its progress in the form of used_ublk (i.e. undo blocks used in the transaction). Subsequent calls show the change in number of used blocks - in case of rollbacks the value diminishes of course.
    set linesize 300 pagesize 1000
    col rollback for a25
    
    SELECT s.username,
           s.sid,
           s.serial#,
           t.used_ublk,
           t.used_urec,
           rs.segment_name,
           round(r.rssize/1024/1024) rs_sz_mb,
           r.status,
           case 
           when bitand(t.flag,power(2,7)) > 0 
           then 'rollback in progress'
           else 'other'
           end as "F Status"
    FROM   v$transaction t,
           v$session s,
           v$rollstat r,
           dba_rollback_segs rs
    WHERE  s.saddr = t.ses_addr
    AND    t.xidusn = r.usn
    AND   rs.segment_id = t.xidusn
    ORDER BY t.used_ublk DESC
    
Finally the application has been restarted, and the application behaviour returned to its normal pace.

Clean-up

Yet during the diagnosis phase I have started again the backup script to free the space, but the script finished with errors due to earlier move of some archivelogs to another location. I did not know at that time what happened with those logs and unable to contact my colleague, so crosschecked the archivelogs and run backup again. This time all has been working properly. In the meantime sysadmins added some 20G to the archivelog storage resource because still the archivelog production has outpaced the backup process. After the application restart there has been some 20 minutes when some transactions still has been rolled back, but after this time the used storage space has dropped to usual state.
The last part of clean-up was to synchronize rman catalog, register the moved archivelogs and backup them as well.
resync catalog ;
catalog start with '[path]';
backup archivelog sequence between 40556 and 40999 delete all input;

Thursday, 7 May 2015

Source of great tech papers (external link)

Cary Millsap's company Method-R published some tech papers - really interesting thing.

Wednesday, 6 May 2015

Asynch descriptor resize (external link)

Excellent analysis of the event in cases it is seen high in query time by Tanel Poder.
We have experienced this behaviour on some databases, but usually it is bound with a particular session, so we used to kill and repeat the action.

Thursday, 2 April 2015

Zabbix and Oracle DRCP

Here short note on possible usage of database resident connection pool feature with Zabbix. This feature allows for pooling database server processes (so they stay ready to use in opposition to standard dedicated processes, which are created at the connection start and destroyed at the end). While at the beginning it seemed to be a interesting issue, it has limited impact on the Zabbix performance.
The main reason for such situation is the fact the Zabbix server keeps persistent connections to a database - in such case DRCP would be a source of problems rather than a helper. Thus only the GUI part may be a subject to such configuration. Furthermore in future versions of Zabbix the database logic is planned to be located in the API, which in turn will be a part of the Zabbix server as well (while the GUI will call the API), and with such architecture the DRCP will be no longer an option.
Anyway this is still interesting feature assuming one makes an extensive use of the GUI, while on version below 3.X
-- this call creates a pool on a database side - first argument will 
-- become a pool name in the future (currently there is only 1 pool)
begin
  dbms_connection_pool.configure_pool(
    null, minsize=5, maxsize=20, inactivity_timeout=>300, max_think_time=>600);
end;
/
-- this call starts the pool
exec dbms_connection_pool.start_pool();
-- this call stops the pool
exec dbms_connection_pool.stop_pool();
-- calls like the one below allows pool parameters change
begin
  dbms_connection_pool.ALTER_PARAM(
    null, param_name=>'session_cached_cursors', param_value=>'50');
end;
/

Monday, 9 March 2015

Jammed AQ notification

Problem definition

One of frequently performed activities in my job is preparing a test environment based on data from production. This is quite lengthy task as the main goal is to create a database as close to a productional one as possible. There are of course some differences - cheaper platform is one of them, another one is limited data volume, yet another limited number of external systems - in turn they implicate further differences.
Thus the most effective way to achieve such goal is to leverage the Data Pump technology and number of home made scripts for partially "manual".
From time to time, while creating such environment, shows up a problem with the automatic AQ notification feature, which is used quite extensively, so we have to get it working properly. One of scenarios, when it comes to the AQ notification jam is the following:
  • we try to check the AQ notification health
  • we run a script, which creates a demo queue and put there a message
  • however during load we implicitly get the situation described here
  • the notification job fails
  • we (i.e. AQ) are stuck
After such sequence of events further enqueues to the AQ queue, which should run automatic callback, fail. I am unable to provide exact reason as there may be a plenty of them - old messages left in notification queue (SYS.AQ_SRVNTFN_TABLE_1), error messages in corresponding error queue (SYS.AQ$_AQ_SRVNTFN_TABLE_1_E), automatic AQ callback job failure due to ORA-28031 and some internal reminiscences after that and so on, but I provide here a sequence of activities to fix it.

Solution

The main goal of the solution is to get rid of the old messages in the queue. First one may look for a post like this one, which tells us, what to do with messages, which are moved to the error queue. In general it is necessary to enable the error queue for dequeue and get all the messages from there.
Next step is to run very similar piece of code to dequeue "manually" all the messages from notification queue - after that the AQ notification works like a new. Additionally I restarted the database, but not sure if that was important. It may be possible You find a locking problem, when dealing with queues - in my case there was a job, which was fired on the notification queue and I was unable e.g. to enable the error queue for dequeue - I managed with it by killing the offending session.

Friday, 20 February 2015

Diag policies (external link)

Link

Some time ago I googled great article by Gavin Soorma (http://gavinsoorma.com/2010/09/purging-trace-and-dump-files-with-11g-adrci). It emphasizes the existence of policies build for information in diagnostics_dest location.

Example usage

#!/bin/bash
## The script sets trace/diagnostics policies for test environments
SHORT_POLICY="SHORTP_POLICY\=308" #in h (14 days)
LONG_POLICY="LONGP_POLICY\=2400" #in h (100 days)

if [ ! $ORACLE_BASE ]; then
  #echo "no ORACLE_BASE defined\n"
  #exit
  oratab='/etc/oratab'
  os=`uname -s`
  if [ $os = 'SunOS' ]; then
      oratab='/var/opt/oracle/oratab'
  fi
  ## or rather read only interesting lines
  readarray -t -u $oratab oratab_entries
  for entry in oratab_entries;
  do
    if [ ${entry:0:1}!='#' ]; then
      ## set sid
      export ORACLE_SID=${entry%%:*}
      ## get the diag_dest
      diag_dest=`sqlplus -S / as sysdba <<-CMD1
set feed off head off
select value from v$parameter where name='diagnostic_dest';
CMD1`
      if [ $diag_dest ]; then
        break;
      fi
    fi
  done;
else
  diag_dest="$ORACLE_BASE/diag"
fi
set -x
echo $diag_dest
cd $diag_dest
for h in `adrci exec="show homes" | grep -v "ADR Homes:"`;
do
  adrci exec="set home $h; set control \($SHORT_POLICY,$LONG_POLICY\)"
done;

Tuesday, 13 January 2015

PSU installation issues - example on AIX

Here short report from today's problems with a PSU installation. It consists of 3 commands actually. Firstly I run usually
opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir ./
to find any problems opatch is able to figure out without running the patch application. The next step assuming all is ok is to run patch application and quite frequently update of database dictionaries
opatch apply
# then within the SQL*Plus run the patched database and call e.g.
@?/rdbms/admin/catbundle.sql psu apply


OPatch can not find patch files

Opatch failed with an error code 73 (which is quite generic). It provided some details presented below:
Failed to load the patch object. Possible causes are: The specified path is not an interim Patch shiphome Meta-data files are missing from the patch area Patch location = /oracle/dist/17082367 Details = PatchObject constructor: Input file "/oracle/dist/17082367/etc/config/actions" or "/oracle/dist/17082367/etc/config/inventory" does not exist. ApplySession failed: PatchObject constructor: Input file "/oracle/dist/17082367/etc/config/actions" or "/oracle/dist/17082367/etc/config/inventory" does not exist. System intact, OPatch will not attempt to restore the system
The reason for this problem is the following. The patch 17082367 is in fact a PSU, so all the patches' files and structures are placed in subfolders, while the opatch looks for them in the parent directory. The solution in my case was to download a newer OPatch version.

OPatch can not copy a library file (AIX)

Another problem arouse with a message:
Copy failed from '/oracle/dist/17082367/13696224/files/lib/libnnz11.so' to '$ORACLE_HOME/lib/libnnz11.so'...
Here important thing is this is the AIX operating system. And frequent thing on AIX is libraries are somewhat locked when loaded in memory. The solution here is to run slibclean as root.

There is too less storage space

It is quite important to check the available storage volume. Fortunately the version of OPatch I used checked that for me. The solution was to clean up to make something like ~500G of free space.

Warnings (AIX)

Common thing on AIX when installing some patches is to get a number of warnings about warnings e.g.
OPatch found the word "warning" in the stderr of the make command. Please look at this stderr. You can re-run this make command. Stderr output: ld: 0711-773 WARNING: Object /oracle/app/11.2.0.2/lib//libgeneric11.a[sdbgrfu.o], imported symbol timezone Symbol was expected to be local. Extra instructions are being generated to reference the symbol.
We ignore them actually.

Friday, 9 January 2015

How to reuse already restored datafiles when recovering database with datafiles rename

Problem

It happens few times a year I have to prepare some test environments, which are complete copies of production databases. Because in such situation everything is compliant with a production environment, I do not have to deal with architecture change so the best way is simply to make a use of RMAN backups, which are stored on tapes somewhere within a backup system.

The task is then quite easy. I produce something like the following:
CONNECT target /
CONNECT catalog rman/xxx@rman_catalog
SPOOL LOG TO restore_db_20141202.log
RUN {
ALLOCATE CHANNEL t1 TYPE 'SBT_TAPE' FORMAT '<backup file name format>' 
PARMS 'ENV=(<backup system specific options>)';
SET UNTIL TIME "to_date('2014-11-30 00:11:00','yyyy-mm-dd hh24:mi:ss')";
RESTORE CONTROLFILE;
ALTER DATABASE MOUNT;
sql "ALTER DATABASE DISABLE BLOCK CHANGE TRACKING";
SET NEWNAME FOR DATABASE TO NEW;
RESTORE DATABASE;
SWITCH DATAFILE ALL;
SWITCH TEMPFILE ALL;
RECOVER DATABASE DELETE ARCHIVELOG MAXSIZE 90G;
RELEASE CHANNEL t1;
}
As You may see I rename all the database files to something new. I do not do this everytime, but sometimes I have on such new environment differently named mount points or sometimes a production database has some history so not all files are kept on the same mount point, while on test environment this is convenient to calculate whole database size, allocate such amount in one resource and mount it under one location.
In this script this location is not specified explicitly - I 'd rather like to set db_create_file_dest parameter of instance - thus also a possibly newly created files are placed on the same path.

Then I run the script and very often I do not have to do anything until the process will end successfully. Yet from time to time shit happens :-( and 9 of 10 cases in our shop it is a temporary problem with the backup system or at least with a communication with it - in case of small databases this would not be a big deal, but first I never met it and second even starting from scratch for a 50G means only a part of an hour.
The real problem is with quite big databases (few TB) and we have one case where it is almost frequent, where the connection to the backup system is done through LAN (in opposition to SAN). Such operation takes at least few days and after few TB restored, while still in a restore phase, we meet a crash.
Some of datafiles are fully restored, some partially, some not at all and yet in v$datafile exist only old file names and for a first sight there is no mapping between newly restored files and metadata content stored within the controlfile.
Below there is a short how to deal with this situation.

Solution

First of all You have to look into views v$datafile_header and v$datafile_copy. The v$datafile_copy should keep names of the newly created files (name is not null and completion_time><some date before running the restore and recover operation>) - then it is easy to create a script, which instead of switching to a new name would switch to the already restored files.
select
'set newname for datafile '||x.file#||' to '||x.descr||';' rman_cmd
from (
select d.file#, 'new' descr from v$datafile d
minus
select c.file#, 'new' descr 
from v$datafile_copy c where c.completion_time > sysdate-5
union all
select c.file#, ''''||c.name||'''' descr 
from v$datafile_copy c where c.completion_time > sysdate-5
) x
order by x.file#
Now we take the above output, paste it into previous script, remove restore controlfile and mount commands and run - the RMAN will check the restored datafiles status, keeps the fully restored intact while restoring those fuzzy or not-yet-restored. And that's it.
Final output will look like this:
CONNECT target /
CONNECT catalog rman/xxx@rman_catalog
SPOOL LOG TO restore_db_20141202.log
RUN {
ALLOCATE CHANNEL t1 TYPE 'SBT_TAPE' FORMAT '<backup file name format>' 
PARMS 'ENV=(<backup system specific options>)';
SET UNTIL TIME "to_date('2014-11-30 00:11:00','yyyy-mm-dd hh24:mi:ss')";

# here changes - while it may seem trivial and unimportant let's look at the file id
# with few thousands of files this saves a lot of work 
SET NEwNAME FOR DATAFILE    1 TO '/{path_to_data}/o1_mf_system_c99zgdp4_.dbf';
SET NEwNAME FOR DATAFILE    2 TO '/{path_to_data}/o1_mf_undotbs1_c99nq5c3_.dbf';
SET NEwNAME FOR DATAFILE    3 TO '/{path_to_data}/o1_mf_sysaux_c99op8h2_.dbf';
SET NEwNAME FOR DATAFILE    4 TO '/{path_to_data}/o1_mf_users_c99ysv13_.dbf';
SET NEwNAME FOR DATAFILE    5 TO new;
SET NEwNAME FOR DATAFILE    6 TO new;
[..]
SET NEwNAME FOR DATAFILE 1419 TO '/{path_to_data}/o1_mf_tstas1_d__c99tghry_.dbf';
SET NEwNAME FOR DATAFILE 1422 TO '/{path_to_data}/o1_mf_tstidx_d_c99zswmb_.dbf';

RESTORE DATABASE;
SWITCH DATAFILE ALL;
SWITCH TEMPFILE ALL;
RECOVER DATABASE DELETE ARCHIVELOG MAXSIZE 90G;
RELEASE CHANNEL t1;
}