Monday, February 13, 2017

How to check table distribution.


Question:
            
How can you tell if a table is well distributed?

Answer:

We hope (by making good PI choices and hashing on the PI columns) to get data distributed across the amps as evenly possible. Ideally, each amp would have the same number of rows. Rarely do we achieve perfection, but we often get close.

The following SQL will count the number of rows on each amp and report the (positive or negative) deviation from the average.

Select  dt1.a (title'AMP') ,dt1.b (title'Rows') ,((dt1.b/dt2.x (float)) - 1.0) * 100 (format'+++9%',title'Deviation') from (select hashamp(hashbucket(hashrow(<primary index>))) ,count(*) from <databasename>.<tablename> group by 1 )dt1 (a,b)  ,(sel (count(*) / (hashamp()+1)(float)) FROM <databasename>.<tablename> )dt2(x) order by 2 desc,1;

Ex :

Select  dt1.a (title'AMP') ,dt1.b (title'Rows') ,((dt1.b/dt2.x (float)) - 1.0) * 100 (format'+++9%',title'Deviation') from (select hashamp(hashbucket(hashrow(RetailstoreID))) ,count(*) from teraqa.copy_src  group by 1 )dt1 (a,b)  ,(sel (count(*) / (hashamp()+1)(float)) FROM  teraqa.copy_src  )dt2(x) order by 2 desc,1;

Running this will show:
      
 AMP         Rows  Deviation
-----------  -----------  ---------
         9           63       +15%
         2           62       +13%
         4           61       +11%
         5           60        +9%
        11           59        +7%
        12           59        +7%
        13           58        +5%
        15           57        +4%
        14           55        +0%
         0           53        -4%
         8           52        -5%
        10           52        -5%
         3           51        -7%
         6           51        -7%
         7           51        -7%
         1           49       -11%

You need not worry until the max or min vary from the average by at least 15%.

Hash synonyms (two different values hashing to the same rowhash) are sometimes an issue with distribution questions. Here is a query to check:

.retlimit 10
Select  h (title'Hash') ,c (title'Frequency') from (select  hashrow(<primary index>) ,count(*) from <database>.<tablename> group by 1 )dt (h,c) qualify rank(c desc) <= 10 order by 2 desc, 1;

Ex :

Select  h (title'Hash') ,c (title'Frequency') from (select  hashrow(RetailstoreID) ,count(*) from teraqa.copy_src  group by 1 )dt (h,c) qualify rank(c desc) <= 10 order by 2 desc, 1;

For example:

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
show table temp.table1;

show table temp.table1;

 *** Text of DDL statement returned.
*** Total elapsed time was 1 second.

---------------------------------------------------------------------------
CREATE SET TABLE temp.table1 ,NO FALLBACK ,
    NO BEFORE JOURNAL,
    NO AFTER JOURNAL
    (
     c1 CHAR(44) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT '
                                    ',
     c2 INTEGER NOT NULL,
     c3 SMALLINT NOT NULL,
     c4 CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT '
                      ')
PRIMARY INDEX I1 ( c1 )
UNIQUE INDEX I2 ( c1 ,c2 )
INDEX I3 ( c3 );
References to Procedures:
           
Additional Info/Comments:

            A lump is a bunch of rows where some column has the same value. Lumpiness is the opposite of uniqueness. The ultimate lump would be all rows in a table having the same value in some column.
Current Attachments:
           


No comments:

Post a Comment