Friday, 29 June 2018

ORA-01554: transaction concurrency limit reached reason:no undo segment found with available slot params:0, 0 :

Some time ago I found such error on one of our databases. It is a bit mysterious as there are not much about it on Metalink (few bugs) or on the internet (at least when searching by Google). But indeed the database hit a limit of concurrent transactions in some very short period of time. On the performance chart in OEM there is a high peak with an INSERT to some custom table, with waits on TX - index contention, buffer busy, enq: SQ contention and undo segment tx slot.
The database is working with UNDO_MANAGEMENT=AUTO, which is a default setting and since 10g used very widely, so the administrator may simply forget about rollback segments management. According to the docs with this setting some work parameters relevant to the rollback management are simply ignored (TRANSACTIONS, TRANSACTIONS_PER_ROLLBACK_SEGMENT), while the number of transactions is limited by the undo tablespace size.
So it seems we should not experience this error, but rather session suspension and the error about lack of space in the undo tablespace.
Yet here are are. It is clear from the numbers below, that still there is a space in the undo (up to 4598 undo blocks used
What I could check is this:
select BEGIN_TIME, UNDOTSN, UNDOBLKS, TXNCOUNT, MAXQUERYLEN, MAXQUERYID, MAXCONCURRENCY, 
ACTIVEBLKS, UNEXPIREDBLKS, EXPIREDBLKS, TUNED_UNDORETENTION 
from v$undostat 
where begin_time between to_date('2018-06-25 14', 'yyyy-mm-dd hh24') and to_date('2018-06-25 16', 'yyyy-mm-dd hh24') 
order by begin_time;

BEGIN_TIMEUNDOBLKSTXNCOUNTMAXQUERYLENMAXCONCURRENCYACTIVEBLKSUNEXPIREDBLKSEXPIREDBLKSTUNED_UNDORETENTION
2018-06-25 14:3907816520160071682372
2018-06-25 14:4914119116532160071682373
2018-06-25 14:5945981042251653131160473658882373
2018-06-25 15:092169103321933311160665643521653
2018-06-25 15:19228217341160806428162454
Since the midnight the number of undo blocks consumed was 63475 with 524416 available blocks (so ~12%) and maximum query length was below 1h. Thus definitely we were far below the undo space limit. From this figures it is clear that still it is possible to hit an error typical for MANUAL undo management scenario, while on AUTO - as here we are.

And here few words of wisdom from Jonathan Lewis:
Taking a quick guess:

An (automatic) undo segment has 34 slots in the transaction table (the thing in the undo segment header).
If you and up with a small number N of undo segments then you can't have more than 34 * N concurrent transactions active because each one needs a transaction table slot and the next attempt to start a transaction would see ORA-01554

It seems a little unlikely that this would happan in typical system since Oracle would usually end up stealing an extent from an existing undo segment to use it as the first extent of a new undo segment. But it wouldn't be hard to set up a mechanism that started with a small number of undo segments in a relatively small undo tablespace - started some transactions (then leave them idle forever after the first change - e.g. they might have come in from a remote database) that made it impossible to free any extents, then did a lot of work to allocate all the extents (but still left a little space in allocated extents for continuting transactions), then started lots of transactions until you hit the error.

And Julian Dyke's presentation:
Undo segments are allocated at instance startup.
Undo segments can be added dynamically.

Each undo segment header contains
  • Pool of free undo extents
  • Set of undo slots

One undo slot is allocated to each transaction.
Undo slot contains list of undo extents.
Extents can migrate from one undo segment to another.
Undo slots are used cyclically:
  • remain in header as long as possible
  • reduces probability of ORA-01555: Snapshot too old

With all this in mind I would say that 1000 times higher transaction rate is in this particular case the main reason for the error. Apparently the db is not able to extend its concurrent transactions rate capability fast enough to respond to extended demand. Possible action here could be initial setting of _rollback_segment_count to a higher value than 12, which are set here.