Monday, 16 January 2012

Cursors with high VERSION_COUNT

Symptoms
Since upgrade to 11gR2 from time to time we experience problems with excessive number of cursor children per some cursors. Symptoms are common - many active sessions with waits on mutex X, mutex S, library cache latch, and few others, high VERSION_COUNT for those cursors.
What happens?
It seems those all waits are due to very slow work of shared pool for those cursors. My understanding of this issue is all the children of such cursor are sharing the same sql_id and hash_value, so eventually they end up in the same bucket of library cache and the more of them, the longer sessions need to hold the latch to library cache or mutexes in order to check all the children to choose the possibly matched for sharing.

Root causes
The root causes for multiplying the cursor children are various. Some reasons are provided by the V$SQL_SHARED_CURSOR, there is a plenty of bugs (search term high VERSION_COUNT on Metalink).
In our case the reasons are few and not sure the concretes, as few explanations are possible. The bug is one of them, however as probable is wrong use of bind variables. From V$SQL_SHARED_CURSOR we get as a reason mainly BIND_MISMATCH, with possible addition of BIND_LENGTH_UPGRADEABLE. Partially this is due to the change of a bind variable length (the buffer for bind value is allocated up to 32 bytes or up to 128 bytes or up to 2000 bytes - at least those were revealed by observing trace 10046, so I would not say how it is beyond 2000 bytes - are there other thresholds or the bind possible maximum is the limit).

Workarounds
Whatever the reasons workarounds are similar and I dare to say "officially" used. What is important this behavior is not seen on 10g (or I would rather say it is hidden). This is due to the obsoleting cursors with more than 1024 children, what is not a case with 11g. So the clue of a workaround is mimic this by:
  1. workaround in the form of purging excessive children with use of dbms_shared_pool.purge
  2. hidden parameter _cursor_obsolete_threshold and the MOS article 10187168.8
True solutions They depend on the root causes. This varies from applying patch to changing the SQL to changing bind variables (size or type) to impossible or better say independent of our doings - great example/article is the entry by Martin Klier. Update 2013.08.08 Few weeks ago we have another attack of mutexes' waits. It was another bug (11930680), where the workaround was to disable optimizer_secure_view_merging. The diagnosis method for such problems may be called generic - when one see a huge bunch of session waiting on "mutex X" or "mutex S" or "cursor pin S wait on X" and all of them try to perform the same query then:
  • count the number of children for the query
  • if high (few hundreds to thousands) look at V$SQL_SHARED_CURSOR in order to figure out why there are so many of them
  • google Google (or any other favourite search engine) and MOS with the reasons found in V$SQL_SHARED_CURSOR
  • as a symptom softening measure one may run the cursor purge as a job in short interval
On the grounds of short experience I may say that usually the very high cursor children count is brought by some bug. With poorly written code we hit ~200 children per cursor, with bugs it was up to 5.000 or more. But as I said the experience is short...
Here one may find that similar contention may arise from different than bug reasons.

No comments: