Monday, 9 March 2015

Jammed AQ notification

Problem definition

One of frequently performed activities in my job is preparing a test environment based on data from production. This is quite lengthy task as the main goal is to create a database as close to a productional one as possible. There are of course some differences - cheaper platform is one of them, another one is limited data volume, yet another limited number of external systems - in turn they implicate further differences.
Thus the most effective way to achieve such goal is to leverage the Data Pump technology and number of home made scripts for partially "manual".
From time to time, while creating such environment, shows up a problem with the automatic AQ notification feature, which is used quite extensively, so we have to get it working properly. One of scenarios, when it comes to the AQ notification jam is the following:
  • we try to check the AQ notification health
  • we run a script, which creates a demo queue and put there a message
  • however during load we implicitly get the situation described here
  • the notification job fails
  • we (i.e. AQ) are stuck
After such sequence of events further enqueues to the AQ queue, which should run automatic callback, fail. I am unable to provide exact reason as there may be a plenty of them - old messages left in notification queue (SYS.AQ_SRVNTFN_TABLE_1), error messages in corresponding error queue (SYS.AQ$_AQ_SRVNTFN_TABLE_1_E), automatic AQ callback job failure due to ORA-28031 and some internal reminiscences after that and so on, but I provide here a sequence of activities to fix it.

Solution

The main goal of the solution is to get rid of the old messages in the queue. First one may look for a post like this one, which tells us, what to do with messages, which are moved to the error queue. In general it is necessary to enable the error queue for dequeue and get all the messages from there.
Next step is to run very similar piece of code to dequeue "manually" all the messages from notification queue - after that the AQ notification works like a new. Additionally I restarted the database, but not sure if that was important. It may be possible You find a locking problem, when dealing with queues - in my case there was a job, which was fired on the notification queue and I was unable e.g. to enable the error queue for dequeue - I managed with it by killing the offending session.

Friday, 20 February 2015

Diag policies (external link)

Link

Some time ago I googled great article by Gavin Soorma (http://gavinsoorma.com/2010/09/purging-trace-and-dump-files-with-11g-adrci). It emphasizes the existence of policies build for information in diagnostics_dest location.

Example usage

#!/bin/bash
## The script sets trace/diagnostics policies for test environments
SHORT_POLICY="SHORTP_POLICY\=308" #in h (14 days)
LONG_POLICY="LONGP_POLICY\=2400" #in h (100 days)

if [ ! $ORACLE_BASE ]; then
  #echo "no ORACLE_BASE defined\n"
  #exit
  oratab='/etc/oratab'
  os=`uname -s`
  if [ $os = 'SunOS' ]; then
      oratab='/var/opt/oracle/oratab'
  fi
  ## or rather read only interesting lines
  readarray -t -u $oratab oratab_entries
  for entry in oratab_entries;
  do
    if [ ${entry:0:1}!='#' ]; then
      ## set sid
      export ORACLE_SID=${entry%%:*}
      ## get the diag_dest
      diag_dest=`sqlplus -S / as sysdba <<-CMD1
set feed off head off
select value from v$parameter where name='diagnostic_dest';
CMD1`
      if [ $diag_dest ]; then
        break;
      fi
    fi
  done;
else
  diag_dest="$ORACLE_BASE/diag"
fi
set -x
echo $diag_dest
cd $diag_dest
for h in `adrci exec="show homes" | grep -v "ADR Homes:"`;
do
  adrci exec="set home $h; set control \($SHORT_POLICY,$LONG_POLICY\)"
done;

Tuesday, 13 January 2015

PSU installation issues - example on AIX

Here short report from today's problems with a PSU installation. It consists of 3 commands actually. Firstly I run usually
opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir ./
to find any problems opatch is able to figure out without running the patch application. The next step assuming all is ok is to run patch application and quite frequently update of database dictionaries
opatch apply
# then within the SQL*Plus run the patched database and call e.g.
@?/rdbms/admin/catbundle.sql psu apply


OPatch can not find patch files

Opatch failed with an error code 73 (which is quite generic). It provided some details presented below:
Failed to load the patch object. Possible causes are: The specified path is not an interim Patch shiphome Meta-data files are missing from the patch area Patch location = /oracle/dist/17082367 Details = PatchObject constructor: Input file "/oracle/dist/17082367/etc/config/actions" or "/oracle/dist/17082367/etc/config/inventory" does not exist. ApplySession failed: PatchObject constructor: Input file "/oracle/dist/17082367/etc/config/actions" or "/oracle/dist/17082367/etc/config/inventory" does not exist. System intact, OPatch will not attempt to restore the system
The reason for this problem is the following. The patch 17082367 is in fact a PSU, so all the patches' files and structures are placed in subfolders, while the opatch looks for them in the parent directory. The solution in my case was to download a newer OPatch version.

OPatch can not copy a library file (AIX)

Another problem arouse with a message:
Copy failed from '/oracle/dist/17082367/13696224/files/lib/libnnz11.so' to '$ORACLE_HOME/lib/libnnz11.so'...
Here important thing is this is the AIX operating system. And frequent thing on AIX is libraries are somewhat locked when loaded in memory. The solution here is to run slibclean as root.

There is too less storage space

It is quite important to check the available storage volume. Fortunately the version of OPatch I used checked that for me. The solution was to clean up to make something like ~500G of free space.

Warnings (AIX)

Common thing on AIX when installing some patches is to get a number of warnings about warnings e.g.
OPatch found the word "warning" in the stderr of the make command. Please look at this stderr. You can re-run this make command. Stderr output: ld: 0711-773 WARNING: Object /oracle/app/11.2.0.2/lib//libgeneric11.a[sdbgrfu.o], imported symbol timezone Symbol was expected to be local. Extra instructions are being generated to reference the symbol.
We ignore them actually.

Friday, 9 January 2015

How to reuse already restored datafiles when recovering database with datafiles rename

Problem

It happens few times a year I have to prepare some test environments, which are complete copies of production databases. Because in such situation everything is compliant with a production environment, I do not have to deal with architecture change so the best way is simply to make a use of RMAN backups, which are stored on tapes somewhere within a backup system.

The task is then quite easy. I produce something like the following:
CONNECT target /
CONNECT catalog rman/xxx@rman_catalog
SPOOL LOG TO restore_db_20141202.log
RUN {
ALLOCATE CHANNEL t1 TYPE 'SBT_TAPE' FORMAT '<backup file name format>' 
PARMS 'ENV=(<backup system specific options>)';
SET UNTIL TIME "to_date('2014-11-30 00:11:00','yyyy-mm-dd hh24:mi:ss')";
RESTORE CONTROLFILE;
ALTER DATABASE MOUNT;
sql "ALTER DATABASE DISABLE BLOCK CHANGE TRACKING";
SET NEWNAME FOR DATABASE TO NEW;
RESTORE DATABASE;
SWITCH DATAFILE ALL;
SWITCH TEMPFILE ALL;
RECOVER DATABASE DELETE ARCHIVELOG MAXSIZE 90G;
RELEASE CHANNEL t1;
}
As You may see I rename all the database files to something new. I do not do this everytime, but sometimes I have on such new environment differently named mount points or sometimes a production database has some history so not all files are kept on the same mount point, while on test environment this is convenient to calculate whole database size, allocate such amount in one resource and mount it under one location.
In this script this location is not specified explicitly - I 'd rather like to set db_create_file_dest parameter of instance - thus also a possibly newly created files are placed on the same path.

Then I run the script and very often I do not have to do anything until the process will end successfully. Yet from time to time shit happens :-( and 9 of 10 cases in our shop it is a temporary problem with the backup system or at least with a communication with it - in case of small databases this would not be a big deal, but first I never met it and second even starting from scratch for a 50G means only a part of an hour.
The real problem is with quite big databases (few TB) and we have one case where it is almost frequent, where the connection to the backup system is done through LAN (in opposition to SAN). Such operation takes at least few days and after few TB restored, while still in a restore phase, we meet a crash.
Some of datafiles are fully restored, some partially, some not at all and yet in v$datafile exist only old file names and for a first sight there is no mapping between newly restored files and metadata content stored within the controlfile.
Below there is a short how to deal with this situation.

Solution

First of all You have to look into views v$datafile_header and v$datafile_copy. The v$datafile_copy should keep names of the newly created files (name is not null and completion_time><some date before running the restore and recover operation>) - then it is easy to create a script, which instead of switching to a new name would switch to the already restored files.
select
'set newname for datafile '||x.file#||' to '||x.descr||';' rman_cmd
from (
select d.file#, 'new' descr from v$datafile d
minus
select c.file#, 'new' descr 
from v$datafile_copy c where c.completion_time > sysdate-5
union all
select c.file#, ''''||c.name||'''' descr 
from v$datafile_copy c where c.completion_time > sysdate-5
) x
order by x.file#
Now we take the above output, paste it into previous script, remove restore controlfile and mount commands and run - the RMAN will check the restored datafiles status, keeps the fully restored intact while restoring those fuzzy or not-yet-restored. And that's it.

Thursday, 4 December 2014

Problem with OPatch

Today I hit a problem with installation of a patchset. Details below:
[oracle@yyy]$ opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir ./19121548
Oracle Interim Patch Installer version 11.2.0.3.5
Copyright (c) 2013, Oracle Corporation.  All rights reserved.

PREREQ session

Oracle Home       : /oracle/product/11.2.0.3
Central Inventory : /oracle/oraInventory
   from           : /oracle/product/11.2.0.3/oraInst.loc
OPatch version    : 11.2.0.3.5
OUI version       : 11.2.0.3.0
Log file location : /oracle/product/11.2.0.3/cfgtoollogs/opatch/opatch2014-08-08_13-19-37PM_1.log

Invoking prereq "checkconflictagainstohwithdetail"
List of Homes on this system:

  Home name= OraDb10g_home1, Location= "/oracle/product/102"
  Home name= OraDb11g_home1, Location= "/oracle/product/11.2.0.2"
  Home name= agent10g1, Location= "/oracle/oagent/product/10.2.0/agent10g"
Prereq "checkConflictAgainstOHWithDetail"not executed 
Unable to create Patch Object.
Exception occured : null

OPatch succeeded.


Colleagues of mine suggested to rebuild the Oracle inventory.
[oracle@yyy bin]$ ./runInstaller -silent -invPtrLoc $ORACLE_HOME/oraInst.loc -attachHome ORACLE_HOME="/oracle/product/11.2.0.3" ORACLE_HOME_NAME="Ora11gR2home"
Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB.   Actual 1239 MB    Passed
'AttachHome' failed.
'AttachHome' failed.


The solution is quite simple. First one have to remove any remnants of current Oracle inventory. Then the inventory operation starts to work properly. With the inventory rebuild the OPatch also starts to work properly.

Thursday, 6 November 2014

ORA-4030 and huge PL/SQL block

Lately we have got for execution a quite huge anonymous PL/SQL block (~10M). When it went for execution it failed with ORA-4030. There are limits to the size of a PL/SQL block. They may be rised at the OS level or in the database (look at "ORA-4030 (PLSQL Opt Pool,pdziM01_Create: New Set), ORA-4030 (PLS CGA hp,pdzgM64_New_Link)" (Doc ID 1551115.1) on the Metalink).

For example on the OS level the /proc/sys/vm/max_map_count parameter sets the number of the maximum number of memory map areas a process may have for Linux.
But before You will mess with those settings, rethink if it is really necessary - most possibly there is another way to run the equivalent code, which makes a better use of the memory in disposal.
In this very case the 10M block consists of 1mln of calls to a procedure with different parameters provided as literals. Much better way in this case would be to load all the parameter values to a table (by sqlldr to persistent table or by simple inserts in one session to a global temporary table with rows preserved on commits) and then call a procedure by much, much smaller script on values from a cursor or get rid of PL/SQL block, turn procedure to function and call it in simple SELECT.