Thursday 29 May 2014

Zabbix 2.2.3 and ZBX-7373


Migration from 2.0.2 to 2.2.3

Lately we have upgraded the Zabbix installation from version 2.0.2 to 2.2.3. The installation run quite smoothly (some minor issues were all bound with our own optimizations) - we did it in around half an hour. So far so good, but when we launch it, we face the high IO load and growing at the steady pace until the whole system was actually out of service.

Diagnostics

As we have already upgraded the database, the return to the previous version would take us several hours, during which we would be quite blind in regard to production operations. So we started to look for the reasons.
  • in the Oracle db there is a concept of services. A service is a name registered on listener and used for identification of a particular database instance. For version 2.0.2 we run the system on 2 service names: zbx_gui and zbx_server. This is easy to do for Zabbix as it actually consists of few components among which there is gui, server and others.
    We spotted that all the IO intensive operations came from the gui part of Zabbix. So we disabled the gui to confirm this observation. The whole system immediately return to proper work. I may add that it does even better than with previous version and assume this is due to use of bind variables - it takes less CPU (though enabling CURSOR_SHARING to FORCE)
  • hopefully the problem has been located in the Zabbix part, which is done in PHP, so the fix is not so troublesome as in case of the part in C (no recompilation, etc). From the monitoring we knew that the most troublesome query was the one, which we later found in the api part. It is run for history tables, which are the biggest entities in the Zabbix database.
    class CHistoryManager {
    
     /**
      * Returns the last $limit history objects for 
      * the given items.
      *
      * @param array $items  an array of items with 
      * the 'itemid' and 'value_type' properties
      * @param int $limit
      *
      * @return array    an array with items IDs as keys 
      * and arrays of history objects as values
      */
     public function getLast(array $items, $limit = 1) {
      $rs = array();
      foreach ($items as $item) {
       $table = self::getTableName($item['value_type']);
       $query = DBselect(
        'SELECT *'.
        ' FROM '.$table.' h'.
        ' WHERE h.itemid='.zbx_dbstr($item['itemid']).
        ' ORDER BY h.clock DESC',
        $limit
       );
       while ($history = DBfetch($query)) {
        $rs[$history['itemid']][] = $history;
       }
      }
    
      return $rs;
     }
    

    My colleagues have found also that similar bag was registered under the ZBX-7373 codename in the Zabbix bug repository.
  • we started to look into details of this query in order to find how to improve it for better performance and here the details of the analysis:
    • we checked the following query
      select * from 
        (select * from ZABBIX.HISTORY H where H.ITEMID=:Z 
           order by H.CLOCK desc) 
      where rownum<2;
      
    • here is the plan for the query
      Plan hash value: 168015457
       OPERATION   OBJ NAME   COST   CR BUFFER GETS   PARTITION START   PARTITION STOP 
      SELECT STATEMENT
         64694       
           COUNT
           10049     
                Filter Predicates 
                   ROWNUM<:SYS_B_0 
                PARTITION RANGE 
         64694   10049   1048575   1 
                    VIEW 
         64694   10049     
                        SORT 
         64694   10049     
                            Filter Predicates 
                                ROWNUM<:SYS_B_0 
                            TABLE ACCESS 
       HISTORY   64277   10049   1048575   1 
                                INDEX 
       P_HISTORY_1   433   70   1048575   1 
                                    Access Predicates 
                                        H.ITEMID=TO_NUMBER(:Z) 

      Predicate Information (identified by operation id):

      1 - filter(ROWNUM<2)
      4 - filter(ROWNUM<2)
      6 - access("H"."ITEMID"=TO_NUMBER(:Z))

Solution

  • we assumed the query simply takes to much blocks, even when optimized with the index range scan. The solution is to limit clock values from the bottom. From start we tried the fixed value of 2 days, but this was both inflexible and wrong in certain cases. Finally we go with the limit which is based on the delay value from the items table.
    Here goes the query in final shape:
    select * from (
    select * from ZABBIX.HISTORY H where H.ITEMID=:Z 
    and CLOCK>(cast((systimestamp at time zone 'UTC') as date)
      -TO_DATE('1970-01-01', 'yyyy-mm-dd'))*86400
        -1*(select delay from ZABBIX.ITEMS where ITEMID=:Z)
    order by H.CLOCK desc
    ) where rownum<2;
    
    -- update 2014-05-30
    -- the query above does not take into consideration items with delay equal to 0
    -- a possible solution to this issue below
    -- Please keep in mind, the query is not thought to service every scenario - e.g. 
    -- the asynchronous
    select * from (
    select * from ZABBIX.HISTORY H where H.ITEMID=:Z 
    and CLOCK>(cast((systimestamp at time zone 'UTC') as date)
      -TO_DATE('1970-01-01', 'yyyy-mm-dd'))*86400
        -(select 
          case 
            -- complex schedule
            when type=0 and delay=0 then 1*50
            -- no schedule, async 
            when type=2 and delay=0 then 3*86400
            else 1*delay
          end delay from ZABBIX.ITEMS where ITEMID=:Z)
    order by H.CLOCK desc
    ) where rownum<2;
    
  • here goes the plan for the final query:

    Plan hash value: 1183558203
     OPERATION   OBJ NAME   COST   CR BUFFER GETS   PARTITION START   PARTITION STOP 
     SELECT STATEMENT 
       609       
          COUNT 
         8     
              Filter Predicates 
                  ROWNUM<:SYS_B_5 
              PARTITION RANGE 
       609   8   1048575   KEY 
                  VIEW 
       609   5     
                      SORT 
       609   5     
                          Filter Predicates 
                              ROWNUM<:SYS_B_5 
                          TABLE ACCESS 
     HISTORY   606   5   1048575   KEY 
                              INDEX 
     P_HISTORY_1   31   4   1048575   KEY 
                                  Access Predicates 
                                      AND 
                                          H.ITEMID=TO_NUMBER(:Z) 
                                          CLOCK>(CAST(SYSTIMESTAMP(6) AT TIME ZONE 'UTC' AS date)-TO_DATE(:SYS_B_1,:SYS_B_2))*:SYS_B_3-:SYS_B_4* 
                                  TABLE ACCESS 
     ITEMS   2   3     
                                      INDEX 
     SYS_C0018758   1   2     
                                          Access Predicates 
                                              ITEMID=TO_NUMBER(:Z) 

    Predicate Information (identified by operation id):

    1 - filter(ROWNUM<2)
    4 - filter(ROWNUM<2)
    6 - access("H"."ITEMID"=TO_NUMBER(:Z)
      AND "CLOCK">(CAST(SYSTIMESTAMP(6) AT TIME ZONE 'UTC' AS date)
      -TO_DATE(' 1970-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))*86400
      -1* (SELECT "DELAY" FROM "ZABBIX"."ITEMS" "ITEMS" WHERE "ITEMID"=TO_NUMBER(:Z)))
    8 - access("ITEMID"=TO_NUMBER(:Z))

  • as the last argument let me provide a short example: for a randomly chosen item the unoptimized query takes 10049 consistent gets and next executions do not change anything, while the optimized query takes for the first time 158 consistent gets and in next executions only 11 per execution.
  • The solution presented here is universal assuming You use the Oracle RDBMS. To get a solution, which would be portable between databases one have to calculate the current unix timestamp in PHP and embed it into the query instead of the part "(cast((systimestamp at time zone 'UTC') as date) - TO_DATE('1970-01-01', 'yyyy-mm-dd'))*86400". The database have to support subqueries (if not the solution would be to call first for the delay value). At the time of upgrade there was no solution to the ZBX-7373, but now it was included into versions 2.3.1 and 2.2.1rc1. Anyway I belive our solution with subquery is more elegant to the one included in this bug.

Update after migration to the 2.4.4 version

Lately we upgraded zabbix to the newest 2 version (2.4.4) and here few remarks.
There were no major issues during the upgrade procedure. AFAIK we lost some filter settings (so there emerged hundreds of not important alerts) and as usually scheduler tasks were disabled. No performance issues at this point.
Fixing of the filters was quite simple - it required repetition of filter creation process.
Then after enabling scheduler tasks (based on the API) we spotted quite heavy I/O load. Replacing CHistoryManager with custom version (according to rules provided above, though we left the new period argument usage, made things work smooth again. Below new code for the getLast method - the signature is compliant with the 2.4.4 version
public function getLast(array $items, $limit = 1, $period = null) {
    $rs = array();
    $que1 = '(cast((systimestamp at time zone \'UTC\') as date)'.
        ' -TO_DATE(\'1970-01-01\', \'yyyy-mm-dd\'))*86400';
    $que2 = 'select case '.
        'when type=0 and delay=0 then '.$limit.'*60 '.
        'when type=2 and delay=0 then 3*86400 '.
        'else '.$limit.'*delay '.
        'end delay1 from ZABBIX.ITEMS where ITEMID=';
    foreach ($items as $item) {
        $values = DBfetchArray(DBselect(
            'SELECT *'.
            ' FROM '.self::getTableName($item['value_type']).' h'.
            ' WHERE h.itemid='.zbx_dbstr($item['itemid']).
            ($period ? ' AND h.clock>'.(time() - $period) : '').
            ' and h.clock >= '.$que1.' -('.$que2.zbx_dbstr($item['itemid']).')'.
            ' ORDER BY h.clock DESC',
            $limit
        ));

        if ($values) {
            $rs[$item['itemid']] = $values;
        }
    }
    return $rs;
}

No comments: