Question:
How can you tell if a table is well
distributed?
Answer:
We hope (by making
good PI choices and hashing on the PI columns) to get data distributed across
the amps as evenly possible. Ideally, each amp would have the same number of
rows. Rarely do we achieve perfection, but we often get close.
The following SQL will count the number of rows on each
amp and report the (positive or negative) deviation from the average.
Select dt1.a (title'AMP') ,dt1.b (title'Rows')
,((dt1.b/dt2.x (float)) - 1.0) * 100 (format'+++9%',title'Deviation') from
(select hashamp(hashbucket(hashrow(<primary index>))) ,count(*) from
<databasename>.<tablename> group by 1 )dt1 (a,b) ,(sel (count(*) / (hashamp()+1)(float)) FROM
<databasename>.<tablename> )dt2(x) order by 2 desc,1;
Ex :
Select dt1.a (title'AMP') ,dt1.b (title'Rows')
,((dt1.b/dt2.x (float)) - 1.0) * 100 (format'+++9%',title'Deviation') from
(select hashamp(hashbucket(hashrow(RetailstoreID))) ,count(*) from
teraqa.copy_src group by 1 )dt1
(a,b) ,(sel (count(*) /
(hashamp()+1)(float)) FROM
teraqa.copy_src )dt2(x) order by
2 desc,1;
Running this will show:
AMP Rows
Deviation
-----------
----------- ---------
9 63 +15%
2 62 +13%
4 61 +11%
5 60 +9%
11 59 +7%
12 59 +7%
13 58 +5%
15 57 +4%
14 55 +0%
0 53 -4%
8 52 -5%
10 52 -5%
3 51 -7%
6 51 -7%
7 51 -7%
1 49 -11%
You need not worry until the max or min vary from the
average by at least 15%.
Hash synonyms (two different values hashing to the same
rowhash) are sometimes an issue with distribution questions. Here is a query to
check:
.retlimit 10
Select h
(title'Hash') ,c (title'Frequency') from (select hashrow(<primary index>) ,count(*) from
<database>.<tablename> group by 1 )dt (h,c) qualify rank(c desc)
<= 10 order by 2 desc, 1;
Ex :
Select h
(title'Hash') ,c (title'Frequency') from (select hashrow(RetailstoreID) ,count(*) from teraqa.copy_src group by 1 )dt (h,c) qualify rank(c desc)
<= 10 order by 2 desc, 1;
For example:
BTEQ -- Enter your
DBC/SQL request or BTEQ command:
show table temp.table1;
show table temp.table1;
*** Text of DDL
statement returned.
*** Total elapsed time was 1 second.
---------------------------------------------------------------------------
CREATE SET TABLE temp.table1 ,NO FALLBACK ,
NO BEFORE
JOURNAL,
NO AFTER
JOURNAL
(
c1 CHAR(44)
CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT '
',
c2 INTEGER NOT
NULL,
c3 SMALLINT
NOT NULL,
c4 CHAR(30)
CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT '
')
PRIMARY INDEX I1 ( c1 )
UNIQUE INDEX I2 ( c1 ,c2 )
INDEX I3 ( c3 );
References to Procedures:
Additional Info/Comments:
A lump
is a bunch of rows where some column has the same value. Lumpiness is the
opposite of uniqueness. The ultimate lump would be all rows in a table having
the same value in some column.
Current Attachments:
No comments:
Post a Comment