Thursday, February 9, 2017

Notes on Statistics in Teradata


=============================================================
When and how should statistics be collected?
=============================================================

Procedure:

            Statistics are data demographics "hints" used by the Teradata optimizer. There are many ways to generate a query plan for a given SQL, and collecting statistics ensures that the optimizer will have the most accurate information to create the best access and join plans. Without collected statistics, the optimizer assumes that any column or non-unique index is highly non-unique and will create join plans accordingly. Collect statistics on any columns or non-unique indexes which are fairly or highly unique so that the optimizer can make proper use of them in join plans, and thus, enhance performance.
  
===============================================================
Statistics should be collected for the following tables / columns
===============================================================

For small tables where the number of rows is less than 5 times the number of AMPs, collect statistics on the primary index.

For large tables, collect statistics on all NUPI's.For all tables, collect statistics on columns in your JOIN condition (i.e. those columns in any DML WHERE clause).

Once collected, statistics should be maintained to accurately reflect the data demographics of the underlying table. It is better to have no statistics than stale or otherwise incorrect statistics. A rule of thumb is to collect statistics when they've changed by 10%. (That would be 10% more rows inserted, or 10% of the rows deleted, or 10% of the rows changed, or some combination.)

 ======================================================================== 
TTIP 1062 - Why table statistics should be kept fresh with COLLECT STATISTICS
=========================================================================

Tech Note:
                                           STATISTICS AND PLANS:

The optimizing phase of the DBC/SQL language processor makes decisions on how to access table data. These decisions can be very important when table joins (especially those involving multiple joins) are required by a query. By default, the Optimizer uses approximations of the number of rows in each table (known as the cardinality of the table) and of the number of unique values in indexes in making its decisions.

The Optimizer gets its approximation of the cardinality of a table by picking a random Access Module Processor (AMP) and asking that AMP how many rows there are in the table. The chosen AMP does not actually count all of the rows it has for the table, but generates an estimate based on the average row size and the number of sectors occupied by the table on that AMP; the Optimizer then multiplies that estimate by the number of AMPs in the system (making an allowance for uneven hash bucket distribution) to estimate the table cardinality. The number of unique index values is similarly estimated. Given that most of the values involved in these estimates, other than the number of AMPs in the system, is an approximation, it is possible (although unusual) for the estimate to be significantly off. This can lead to poor choices of join plans, and associated increases in the response times of the queries involved.

One way to help the Optimizer make better decisions is to give it more accurate information as to the content of the table. This can be done using the COLLECT STATISTICS statement (see the Teradata DBS Reference Manual for details). When the Optimizer finds that there are statistics available for a referenced table, it will use those statistics instead of using estimated table cardinality or estimated unique index value counts (that is, it will trust the statistics).


======================================================
STALE STATISTICS
======================================================

Under normal circumstances, this is the right thing to do; statistics provide more detailed information, and include an exact row count as of the time that the statistics were gathered. However, if the statistics are "stale" -- that is, the table's characteristics (distribution of data values for a column or index for which statistics have been collected, number of rows in the table, etc.) have changed significantly since the statistics were last gathered, the Optimizer can be misled into making poor, or even horrible, join plans, with associated poor performance of queries which use the stale statistics.

To take an extreme case:

Table A - statistics gathered when table had 1,000 rows, but table
now has 1,000,000 rows (perhaps statistics were gathered
during the prototyping phase)

Table B - no statistics; table has 75,000 rows

If a product join between Table A and Table B is necessary for a given query, and one of the tables must be duplicated on all AMPs, then the Optimizer will choose Table A to be duplicated, since 1,000 rows (from the stale statistics) is much less than 75,000 rows. Since in reality Table A now has 1,000,000 rows, the Optimizer will be making a very bad decision (duplicating 1,000,000 rows instead of 75,000), and the query will run much longer than necessary.
  
Therefore, it is important that statistics be kept fresh. The Reference Manual recommends that statistics be recollected if there is as little as a 10% change -- rows added or deleted -- in the table. There are two general circumstances under which statistics can be considered to be stale:

1. The number of rows in the table has changed significantly.

The number of unique values for each statistic on a table, as well as the
date and time the statistics were last gathered, can be obtained by:

HELP STATISTICS tablename;

For statistics on unique indexes, this can be cross-checked by comparing
the row count returned by:

SELECT COUNT(*) FROM tablename;

For statistics on non-unique columns, the HELP STATISTICS result can be
cross-checked by comparing the count returned by:

SELECT COUNT(DISTINCT columnname) FROM tablename;

2. The range of values for a index or column of a table for which statistics have been collected has changed significantly. Sometimes one can infer this from the date and time the statistics were last collected, or by the very nature of the column (for instance, if the column in question holds a transaction date, and statistics on that column were last gathered a year ago, it is almost certain that the statistics for that column are stale).

 =========================================================
REFRESHING STALE STATISTICS
==========================================================
If the statistics for a table may be stale, they can be easily recollected:

COLLECT STATISTICS ON tablename;

will recollect statistics on all indexes and columns for which previous COLLECT STATISTICS statements were done (and for which DROP STATISTICS statements have not been done). Because collecting statistics involves a full-table scan, it may take a significant amount of time, and should therefore be done off-hours for large tables. It may be revealing to execute:

HELP STATISTICS tablename;

before and after recollecting statistics, to see what, if any, difference the recollect makes. Also, for frequently executed queries, requesting an EXPLAIN before and after recollecting statistics may show differences in join plans and/or spool row count/processing time estimates.

In addition to keeping statistics current, the Reference Manual also recommends that statistics be collected for the primary index of small tables involved in joins.

STALE STATISTICS EXAMPLE:
Here is an example of stale statistics. The table definition is:

 CREATE TABLE bpaefer.taemg22, FALLBACK,
       ( m22_m_fer_ff DECIMAL(10,2) NOT NULL
       , m22_d_fin_validite DATE FORMAT 'YYYY-MM-DD'
       , m22_m_fer_dev DECIMAL(9,2) NOT NULL
       , m08_c_uic_m08 CHAR(2) NOT NULL
       , m22_d_deb_v_id DATE FORMAT 'YYYY-MM-DD' NOT NULL
       , p66_c_code_id CHAR(8) NOT NULL
       )
       UNIQUE PRIMARY INDEX ( m22_d_deb_v_id, p66_c_code_id ) ;

Statistics were collected on the unique primary index and individually on each of the two columns comprising that index:

COLLECT STATISTICS ON bpaefer.taemg22
INDEX ( m22_d_deb_v_id, p66_c_code_id ) ;
COLLECT STATISTICS ON bpaefer.taemg22
COLUMN m22_d_deb_v_id ;
COLLECT STATISTICS ON bpaefer.taemg22
COLUMN p66_c_code_id ;

A recent check of this table and its statistics revealed the following:

SELECT COUNT(*) FROM bpaefer.taemg22;

resulted in:

Count(*)
-----------
9203

and

HELP STATISTICS bpaefer.taemg22;

resulted in:

Date Time Unique Values Column Names
-------- -------- -------------------- ----------------------------
93/11/03 12:39:55 17 M22_D_DEB_V_ID
93/11/03 12:40:09 2,095 P66_C_CODE_ID
93/11/03 13:57:17 2,275 M22_D_DEB_V_ID,P66_C_CODE_ID

Note that statistics were most recently collected almost three years ago (as of the date of this TTIP). Also note that the last line from the HELP STATISTICS is for the UNIQUE primary index, and therefore reflects the number of rows in the table when the statistics were last gathered. At that time, there were 2275 rows in the table; there are currently 9203 rows in the table, a factor of more than four difference. Finally, note that the M22_D_DEB_V_ID column is a DATE, with relatively few unique values. With the four-fold increase in the number of rows, it is almost certain that additional values for this column are present.

It is clearly time to recollect this table's statistics.

================================================================================
Script to generate COLLECT STATISTICS statements for all non-unique indexes that have not had statistics collected
 ================================================================================

Tech Note:

            The SQL below will generate COLLECT STATISTICS statements for all non-unique indexes that have not had statistics collected, except for database crashdumps and DBC. NOTE: You must logon as DBC to generate the batch job and execute it. After executing the script there is a file named as 'collect' that contains the COLLECT STATISTICS statements.

/*=================================================================*/
/* Create a working database. */

create database stats as perm = 10000000;

/* Grant a right needed for macro. */

grant select on DBC to stats with grant option;

/* Create the work table. */

ct stats.idx_info
(tvmid byte(6)
,indexnumber smallint
,NameText varchar(512))
unique primary index (tvmid,indexnumber);

/* Create the macro. */

replace macro stats.build_idx_info
(pass smallint)
as
(
 update stats.idx_info
 set NameText = trim(NameText) ||
                ','            ||
                trim(DBC.TVFields.fieldname)
 where idx_info.tvmid = DBC.Indexes.tableid
  and  idx_info.indexnumber = Indexes.indexnumber
  and  TVFields.tableid = Indexes.tableid
  and  TVFields.fieldid = Indexes.fieldid
  and  DBC.TVM.tvmid = Indexes.tableid
  and  DBC.Dbase.databaseid = TVM.databaseid
  and  Indexes.fieldposition = :pass;
);

/* Populate the work table. */

ins stats.idx_info
sel i.tableid
   ,i.indexnumber
   ,trim(f.fieldname)
from DBC.indexes  i
    ,DBC.tvfields f
    ,DBC.dbase    d
    ,DBC.tvm      t
where f.tableid = i.tableid
 and  f.fieldid = i.fieldid
 and  t.tvmid = i.tableid
 and  d.databaseid = t.databaseid
 and  i.UniqueFlag <> 'Y'
 and  i.fieldposition = 1
 and  i.indexstatistics is null
 and  d.databasenamei NOT IN ('dbc','crashdumps');

/* Determine N. */

sel max(i.fieldposition)
from DBC.indexes  i
    ,DBC.dbase    d
    ,DBC.tvm      t
where t.tvmid = i.tableid
 and  d.databaseid = t.databaseid
 and  d.databasenamei NOT IN ('dbc','crashdumps')
 and  i.UniqueFlag <> 'Y';

/* Run macro to add fields. */
/* Run multiple times.      */
/* Use arguments 2 to N.    */

exec stats.build_idx_info(2);

/* ADD ADDITIONAL MACRO EXECUTIONS HERE!!! */

/* Generate grant statements. */

.width 254
.export report file=grant

sel 'grant drop table on '||
    trim(d.databasename)  ||
    ' to DBC;' (title'')
from stats.idx_info  idx
    ,DBC.tvm       t
    ,DBC.dbase     d
where idx.tvmid = t.tvmid
 and  t.databaseid = d.databaseid
group by 1
order by 1;

.export reset

/* Generate collect statistics statements. */

.export report file=collect

sel 'collect statistics '||
    trim(d.databasename) ||
    '.'                  ||
    trim(t.tvmname)      ||
    ' index('            ||
    trim(idx.NameText)   ||
    ');' (title'')
from stats.idx_info  idx
    ,DBC.tvm       t
    ,DBC.dbase     d
where idx.tvmid = t.tvmid
 and  t.databaseid = d.databaseid
order by d.databasename
        ,t.tvmname
        ,idx.indexnumber;

.export reset

/* Your work is done except for .running the files. */
  
============================================================
GSC Canary Scripts for Teradata Performance Update
============================================================

Condition/Symptom:
           
The Canary script has been replaced starting with V2R4.1 with a built in tool.

Here is some important information regarding canary scripts:

We were asked why a canary trigger for one customer site didn't fire when they are constantly busy or are having problems.

There can be multiple reason but one thing you need to check is to make sure the canary script is logging on to the right performance group.

When you (or customer) install the canary script, you want the user to logon in the same performance group as where the problem is occurring. The default is using SYSTEMFE as the userid; systemfe user usually runs under $M in the default partition.

If the customer has a lot of user partitions, most likely the perflook will not be triggered since the default partition usually has highest weights.

For this site, we found that the canary was logging in as $H:

You can find this by in canary.sh. If you go to the bottom of the script or search for "logon", you will find the userid. In this case the user id was CanaryUser.

# Part of canary.sh
#---------------------------------------------------------

bteq >$MONITOR_DIR/bteq.o 2> $MONITOR_DIR/bteq.e <<- [end]
.logon sysc/CanaryUser,hm53a2
.export file=/dev/null
SELECT * FROM DBC.DBCINFO;
.quit
[end]

Once you find out the userid you will need to find the associated account string. You will need to logon to the database. You can then run the following select statement (replace canaryuser with the user id you found in your script):

select accountname from DBC.DATABASES where databasename = 'canaryuser';
  
BTEQ -- Enter your DBC/SQL request or BTEQ command:
select accountname from dbc.databases where databasename = 'canaryuser';

select accountname from dbc.databases where databasename = 'canaryuser';

*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.

AccountName
------------------------------
$H$|&I

For this user id, canaryuser, the user is logging on as $H. Therefore the user is getting a very high priority and probably will not trigger the canary unless the system really was hanging.

The information came from the view databases:
If you run help view dbc.databases, you will see quite a bit of information about each database or user.

Following is some information from the help.

help view dbc.databases;

*** Help information returned. 16 rows.
*** Total elapsed time was 1 second.

Column Name Type Comment
------------------------------ ---- ---------------------------------------
DatabaseName ? The Databases.DatabaseName field identi
CreatorName ? The Databases.CreatorName field identif
OwnerName ? The Databases.OwnerName field identifie
AccountName ? The Databases.AccountName field identif
ProtectionType ? The Databases.ProtectionType field spec
JournalFlag ? The Databases.JournalFlag field specifi
PermSpace ? The Databases.PermSpace field specifies
SpoolSpace ? The Databases.SpoolSpace field specifie
TempSpace ? The Databases.TempSpace field specifies
CommentString ? The Databases.CommentString field conta


============================================================== 
Teradata Query to Summarize Access Logging Data
==============================================================

Tech Note:

            The following query will provide a summary report on the Access Logging data stored in Teradata:

Select DT.ld (format 'mm/dd/yyyy', title 'Date')
, DT.st (title '//Statement//Type')
, DT.db (format 'X(13)', title 'DatabaseName')
, DT.cnt (title 'Number//of//Statements')
FROM
(Select LogDate
, StatementType
, DatabaseName
, count(*)
FROM dbc.accesslog
WHERE StatementType not in ('select', 'Begin/End Logging',
'Show','Exec')
GROUP BY 1,2,3) DT(ld, st, db, cnt)
GROUP BY 1,2,3,4
WITH sum (DT.cnt)(title 'Total') BY DT.ld
ORDER BY 1,3,2;
Here is a sample of the output.....

Number
Statement of
Date Type DatabaseName Statements
---------- -------------------- ------------- -----------
10/29/2001 Create User DBC 1
-----------
Total 1
  
10/30/2001 Grant dbc 1
10/30/2001 Replace Macro DBC 6
10/30/2001 Replace View DBC 2
10/30/2001 Collect Statistics dss_tables 97
10/30/2001 Create Index dss_tables 41
10/30/2001 Create Table dss_tables 12
10/30/2001 Delete dss_tables 26
10/30/2001 Drop Index dss_tables 15
10/30/2001 Drop Statistics dss_tables 3
10/30/2001 Drop Table dss_tables 12
10/30/2001 Help dss_tables 72
10/30/2001 Insert dss_tables 114
10/30/2001 Rename Table dss_tables 6
10/30/2001 Update dss_tables 542
-----------
Total 949


5 comments: