Migration from 2.0.2 to 2.2.3
Lately we have upgraded the Zabbix installation from version 2.0.2 to 2.2.3. The installation run quite smoothly (some minor issues were all bound with our own optimizations) - we did it in around half an hour. So far so good, but when we launch it, we face the high IO load and growing at the steady pace until the whole system was actually out of service.Diagnostics
As we have already upgraded the database, the return to the previous version would take us several hours, during which we would be quite blind in regard to production operations. So we started to look for the reasons.- in the Oracle db there is a concept of services. A service is a name registered on listener and used for identification of a particular database instance. For version 2.0.2 we run the system on 2 service names: zbx_gui and zbx_server. This is easy to do for Zabbix as it actually consists of few components among which there is gui, server and others.
We spotted that all the IO intensive operations came from the gui part of Zabbix. So we disabled the gui to confirm this observation. The whole system immediately return to proper work. I may add that it does even better than with previous version and assume this is due to use of bind variables - it takes less CPU (though enabling CURSOR_SHARING to FORCE) - hopefully the problem has been located in the Zabbix part, which is done in PHP, so the fix is not so troublesome as in case of the part in C (no recompilation, etc). From the monitoring we knew that the most troublesome query was the one, which we later found in the api part. It is run for history tables, which are the biggest entities in the Zabbix database.
class CHistoryManager { /** * Returns the last $limit history objects for * the given items. * * @param array $items an array of items with * the 'itemid' and 'value_type' properties * @param int $limit * * @return array an array with items IDs as keys * and arrays of history objects as values */ public function getLast(array $items, $limit = 1) { $rs = array(); foreach ($items as $item) { $table = self::getTableName($item['value_type']); $query = DBselect( 'SELECT *'. ' FROM '.$table.' h'. ' WHERE h.itemid='.zbx_dbstr($item['itemid']). ' ORDER BY h.clock DESC', $limit ); while ($history = DBfetch($query)) { $rs[$history['itemid']][] = $history; } } return $rs; }
My colleagues have found also that similar bag was registered under the ZBX-7373 codename in the Zabbix bug repository. - we started to look into details of this query in order to find how to improve it for better performance and here the details of the analysis:
- we checked the following query
select * from (select * from ZABBIX.HISTORY H where H.ITEMID=:Z order by H.CLOCK desc) where rownum<2;
-
here is the plan for the query
Plan hash value: 168015457 OPERATION OBJ NAME COST CR BUFFER GETS PARTITION START PARTITION STOP SELECT STATEMENT 64694 COUNT 10049 Filter Predicates ROWNUM<:SYS_B_0 PARTITION RANGE 64694 10049 1048575 1 VIEW 64694 10049 SORT 64694 10049 Filter Predicates ROWNUM<:SYS_B_0 TABLE ACCESS HISTORY 64277 10049 1048575 1 INDEX P_HISTORY_1 433 70 1048575 1 Access Predicates H.ITEMID=TO_NUMBER(:Z)
Predicate Information (identified by operation id):
1 - filter(ROWNUM<2)
4 - filter(ROWNUM<2)
6 - access("H"."ITEMID"=TO_NUMBER(:Z))
- we checked the following query
Solution
- we assumed the query simply takes to much blocks, even when optimized with the index range scan. The solution is to limit clock values from the bottom. From start we tried the fixed value of 2 days, but this was both inflexible and wrong in certain cases. Finally we go with the limit which is based on the delay value from the items table.
Here goes the query in final shape:select * from ( select * from ZABBIX.HISTORY H where H.ITEMID=:Z and CLOCK>(cast((systimestamp at time zone 'UTC') as date) -TO_DATE('1970-01-01', 'yyyy-mm-dd'))*86400 -1*(select delay from ZABBIX.ITEMS where ITEMID=:Z) order by H.CLOCK desc ) where rownum<2; -- update 2014-05-30 -- the query above does not take into consideration items with delay equal to 0 -- a possible solution to this issue below -- Please keep in mind, the query is not thought to service every scenario - e.g. -- the asynchronous select * from ( select * from ZABBIX.HISTORY H where H.ITEMID=:Z and CLOCK>(cast((systimestamp at time zone 'UTC') as date) -TO_DATE('1970-01-01', 'yyyy-mm-dd'))*86400 -(select case -- complex schedule when type=0 and delay=0 then 1*50 -- no schedule, async when type=2 and delay=0 then 3*86400 else 1*delay end delay from ZABBIX.ITEMS where ITEMID=:Z) order by H.CLOCK desc ) where rownum<2;
- here goes the plan for the final query:
Plan hash value: 1183558203OPERATION OBJ NAME COST CR BUFFER GETS PARTITION START PARTITION STOP SELECT STATEMENT 609 COUNT 8 Filter Predicates ROWNUM<:SYS_B_5 PARTITION RANGE 609 8 1048575 KEY VIEW 609 5 SORT 609 5 Filter Predicates ROWNUM<:SYS_B_5 TABLE ACCESS HISTORY 606 5 1048575 KEY INDEX P_HISTORY_1 31 4 1048575 KEY Access Predicates AND H.ITEMID=TO_NUMBER(:Z) CLOCK>(CAST(SYSTIMESTAMP(6) AT TIME ZONE 'UTC' AS date)-TO_DATE(:SYS_B_1,:SYS_B_2))*:SYS_B_3-:SYS_B_4* TABLE ACCESS ITEMS 2 3 INDEX SYS_C0018758 1 2 Access Predicates ITEMID=TO_NUMBER(:Z)
Predicate Information (identified by operation id):
1 - filter(ROWNUM<2)
4 - filter(ROWNUM<2)
6 - access("H"."ITEMID"=TO_NUMBER(:Z)
AND "CLOCK">(CAST(SYSTIMESTAMP(6) AT TIME ZONE 'UTC' AS date)
-TO_DATE(' 1970-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))*86400
-1* (SELECT "DELAY" FROM "ZABBIX"."ITEMS" "ITEMS" WHERE "ITEMID"=TO_NUMBER(:Z)))
8 - access("ITEMID"=TO_NUMBER(:Z))
- as the last argument let me provide a short example: for a randomly chosen item the unoptimized query takes 10049 consistent gets and next executions do not change anything, while the optimized query takes for the first time 158 consistent gets and in next executions only 11 per execution.
- The solution presented here is universal assuming You use the Oracle RDBMS. To get a solution, which would be portable between databases one have to calculate the current unix timestamp in PHP and embed it into the query instead of the part "(cast((systimestamp at time zone 'UTC') as date) - TO_DATE('1970-01-01', 'yyyy-mm-dd'))*86400". The database have to support subqueries (if not the solution would be to call first for the delay value). At the time of upgrade there was no solution to the ZBX-7373, but now it was included into versions 2.3.1 and 2.2.1rc1. Anyway I belive our solution with subquery is more elegant to the one included in this bug.
Update after migration to the 2.4.4 version
Lately we upgraded zabbix to the newest 2 version (2.4.4) and here few remarks.There were no major issues during the upgrade procedure. AFAIK we lost some filter settings (so there emerged hundreds of not important alerts) and as usually scheduler tasks were disabled. No performance issues at this point.
Fixing of the filters was quite simple - it required repetition of filter creation process.
Then after enabling scheduler tasks (based on the API) we spotted quite heavy I/O load. Replacing CHistoryManager with custom version (according to rules provided above, though we left the new period argument usage, made things work smooth again. Below new code for the getLast method - the signature is compliant with the 2.4.4 version
public function getLast(array $items, $limit = 1, $period = null) { $rs = array(); $que1 = '(cast((systimestamp at time zone \'UTC\') as date)'. ' -TO_DATE(\'1970-01-01\', \'yyyy-mm-dd\'))*86400'; $que2 = 'select case '. 'when type=0 and delay=0 then '.$limit.'*60 '. 'when type=2 and delay=0 then 3*86400 '. 'else '.$limit.'*delay '. 'end delay1 from ZABBIX.ITEMS where ITEMID='; foreach ($items as $item) { $values = DBfetchArray(DBselect( 'SELECT *'. ' FROM '.self::getTableName($item['value_type']).' h'. ' WHERE h.itemid='.zbx_dbstr($item['itemid']). ($period ? ' AND h.clock>'.(time() - $period) : ''). ' and h.clock >= '.$que1.' -('.$que2.zbx_dbstr($item['itemid']).')'. ' ORDER BY h.clock DESC', $limit )); if ($values) { $rs[$item['itemid']] = $values; } } return $rs; }