Tuesday, 20 November 2012

Duplicating for standby

On the start I have a target database configured with OMF and specified unique db name. On the standby side I created an instance, started it and that's it. Oh, yeah - I copied password file from primary side to be able to connect to standby and configured listener with GLOBAL_DBNAME, which allows for registering a service for use in tnsnames.ora on both hosts. Of course I specified proper entries in tnsnames.ora as well.
There are possible many configurations for this duplication - combinations of a target database, a standby database and a catalog - of course a standby one is a must, but the rest is optional and may be avoided, although it complicates the matter a little. I choose to use additionally to standby database also active target database. This avoids the trouble of applying all the redo logs from last backup.
RMAN> DUPLICATE TARGET DATABASE FOR STANDBY FROM ACTIVE DATABASE;
First failure: the command failed with ORA-17628: Oracle error 19505 returned by remote Oracle server on primary and ORA-19505: failed to identify file "/oracle/oradata/ZABBIX/controlfile/o1_mf_855ys7j8_.ctl"
ORA-27040: file create error, unable to create file
as I did not create path for controlfile (so now mkdir -p /oracle/oradata/ZABBIX/controlfile/o1_mf_855ys7j8_.ctl).

Second start and failure:
RMAN-03002: failure of Duplicate Db command at 11/13/2012 10:14:19
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script
ORA-06550: line 1, column 17:
PLS-00201: identifier 'DBMS_RCVCAT.GETDBID' must be declared
ORA-06550: line 1, column 7:
I found at http://blog.csdn.net/tianlesoftware/article/details/6232292 that it is due to not specifying nocatalog when connecting with rman. And that really helped.
Third try and this time I moved much ahead - I was wrong with assuming the online logs will be created according to the db_create_file_dest (so I should set db_create_online_log_dest_n at least to overcome this). However this time duplication is going forward - the data files are created according to the plan.
Because copy is done with backup as copy reuse datafile X auxiliary format new then no useless copying to storage local to primary and all bytes goes directly to the standby files, which is nice

To finish the standby I need now to add standby log files (though this is optional) and set log_archive_dest_state_2 to enable. The managed standby is already run by the duplicate command. One may want to switch this to recovery using current logfile.

Distributed transactions holding locks on some resources

One of the main problems with XA transactions seems to be locking of rows by distributed transaction. While this is perfectly normal, when locking lasts a short time, it becomes a nightmare when it lasts forever. There are few scenarios, which may end up with such long-lasting lock.
Many comprehensive articles may be found on Metalink:
  1. Solving locking problems in a XA environment [ID 1248848.1] - short, thorough and I like it
  2. Manually Resolving In-Doubt Transactions: Different Scenarios [ID 126069.1] - long and boring
In many scenarios the issue should resolve automatically. If this is not the case then theoretically the distributed transaction should be recognized by the resource manager as 'in-doubt' and show up in the dataset from the dba_2pc_pending view. If so then a DBA should settle with a transaction manager administrator if the transaction 'in-doubt' should be commited or rollbacked and then on the resource manager side the DBA should call appropriately commit or rollback with force option and local transaction id as an argument. Sometimes additionally one needs to call DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('[LOCAL_TRAN_ID]');
Even easier such situation may be resolved from XA client side - we may prepare a fake client, which binds to the database as a side of the chosen distributed transaction and commit or rollback it according to the needs, while whole the cleaning will be done by standard mechanisms (which is recommended way of dealing with such issues) - such example client code may be obtained from the first mentioned article.
Yet another way of resolving the issue is the database reboot (but this is not always possible).

The distributed transaction goes through few phases.
When in 1st phase - not yet prepared - the locks are seen in v$lock view and diagnostic information is easy to obtain.
After prepare call and response those locks are no longer seen in v$lock and then the diagnostics is more hard. Yet another possibility here is the transaction process on the transaction manager side was killed by the system - then we end up with an active transaction still blocking some resources, but there is no locks seen (there is however entry in the v$locked_object view, so at least we may find objects on which there are some locked rows) and no session attached.
In this scenario we may use the excellent view from Tanel Poder changing the equal join to a left join between sys.x$k2gte2 and v$session - from this we may see that some transactions in x$k2gte2 have no corresponding row from the v$session view - these are our candidates for forced transaction finish - in x$k2gte2 we may find the global id of the transaction 'in-doubt' and the local id can be constructed from the join with sys.x$ktcxb (x$k2gte2.k2gtdxcb=x$ktcxb.ktcxbxba) on columns kxidusn, kxidslt, kxidsqn (it seems no need for left join here) so we may do something about it.
Once there was a workaround published on Metalink, which involved an insert into pending_trans$ and pending_session$, but I could not find it lately - so I suppose they recommend rather the use of fake client here. However one may still read about it after some google search - voila http://tazlambert.wordpress.com/2011/11/24/ora-01591-lock-held-by-in-doubt-distributed-transaction-string/ (look at the end of the article).
As a kind of curiosity I may add that I found even a x$k2gte2 entry, which had no corresponding entry in the sys.x$ktuxe structure or the v$transaction view. In sys.x$ktcxb there still was an entry however(that is why I assumed no left join needed here), but all the parts of local transaction id were equal to 0, so I suppose this trash has no impact on the database activity and the only way to clean it is a reboot.

XA on Oracle database

The Oracle RDBMS is XA compliant from scratch - no adjustments needed - http://www.orafaq.com/wiki/XA_FAQ is a good starting point on this subject.

One may create additional views for dealing with troublesome situations. The Oracle views v$XATRANS$ and V$PENDING_XATRANS$ may be created in order to help with diagnostics (they are created by running @?/rdbms/admin/xaview.sql).
Another very good view is accessible at http://blog.tanelpoder.com/files/scripts/gtrans.sql. I usually change the join between sys.x$k2gte2 and v$session into LEFT JOIN as still it is possible to have on a database active distributed transactions not connected to any session.

Update from 6 May 2014

Lately we have got a request to fix the cooperation between the Oracle RDBMS and the Tuxedo.
The errors as following:
080307.tux24int!TMS_ORA.18176.1.0: 05-06-2014: Tuxedo Version 11.1.1.2.0, 32-bit
080307.tux24int!TMS_ORA.18176.1.0: LIBTUX_CAT:262: INFO: Standard main starting
080307.tux24int!TMS_ORA.18176.1.0: CMDTUX_CAT:409: ERROR: xa_recover() returned -3 for group ON1LT
080307.tux24int!TMS_ORA.18176.1.0: LIBTUX_CAT:250: ERROR: tpsvrinit() failed
080307.tux24int!tmboot.18147.1.-2: CMDTUX_CAT:825: ERROR: Process TMS_ORA at tux24int failed with /T tperrno (TPESYSTEM - internal system error)
080307.tux24int!TMS_ORA.18177.1.0: 05-06-2014: Tuxedo Version 11.1.1.2.0, 32-bit
080307.tux24int!TMS_ORA.18177.1.0: LIBTUX_CAT:262: INFO: Standard main starting
080308.tux24int!TMS_ORA.18177.1.0: CMDTUX_CAT:409: ERROR: xa_recover() returned -3 for group ON1LT
080308.tux24int!TMS_ORA.18177.1.0: LIBTUX_CAT:250: ERROR: tpsvrinit() failed
080308.tux24int!tmboot.18147.1.-2: CMDTUX_CAT:825: ERROR: Process TMS_ORA at tux24int failed with /T tperrno (TPESYSTEM - internal system error)
Well, it seems that one needs some adjustments - of course this is possible it is our configuration, where the devil hides. Anyway despite the running the xaview script I also granted the privileges below to the user, which the Tuxedo connects to
grant SELECT on SYS.DBA_PENDING_TRANSACTIONS to {schema};
grant EXECUTE on SYS.DBMS_SYSTEM to {schema};
-- this one is to a view created by xaview.sql
grant SELECT on SYS.V$XATRANS$ to {schema}; 
It seems it solved the issue.