Timestamp Expression Statistics and Compatibility

planner also does not realize that the value of the date truncated to the month is fully determined by the value of the date truncated to the day. Then expression and ndistinct statistics are built on those two expressions: <programlisting> CREATE TABLE t3 ( a timestamp ); INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp, '2020-12-31'::timestamp, '1 minute'::interval) s(i); ANALYZE t3; -- the number of matching rows will be drastically underestimated: EXPLAIN ANALYZE SELECT * FROM t3 WHERE date_trunc('month', a) = '2020-01-01'::timestamp; EXPLAIN ANALYZE SELECT * FROM t3 WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp AND '2020-06-30'::timestamp; EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2; -- build ndistinct statistics on the pair of expressions (per-expression -- statistics are built automatically) CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3; ANALYZE t3; -- now the row count estimates are more accurate: EXPLAIN ANALYZE SELECT * FROM t3 WHERE date_trunc('month', a) = '2020-01-01'::timestamp; EXPLAIN ANALYZE SELECT * FROM t3 WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp AND '2020-06-30'::timestamp; EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2; </programlisting> Without expression and ndistinct statistics, the planner has no information about the number of distinct values for the expressions, and has to rely on default estimates. The equality and range conditions are assumed to have 0.5% selectivity, and the number of distinct values in the expression is assumed to be the same as for the column (i.e. unique). This results in a significant underestimate of the row count in the first two queries. Moreover, the planner has no information about the relationship between the expressions, so it assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal> conditions are independent, and multiplies their selectivities together to arrive at a severe overestimate of the group count in the aggregate query. This is further exacerbated by the lack of accurate statistics for the expressions, forcing the planner to use a default ndistinct estimate for the expression derived from ndistinct for the column. With such statistics, the planner recognizes that the conditions are correlated, and arrives at much more accurate estimates. </para> </refsect1> <refsect1> <title>Compatibility</title> <para> There is no <command>CREATE STATISTICS</command> command in the SQL standard. </para> </refsect1> <refsect1> <title>See Also</title> <simplelist type="inline"> <member><xref linkend="sql-alterstatistics"/></member> <member><xref linkend="sql-dropstatistics"/></member> </simplelist> </refsect1> </refentry>

This section elaborates on using expression and ndistinct statistics on timestamp columns. Without these statistics, the query planner underestimates the row count and makes incorrect assumptions about the relationship between date truncation expressions (month and day). Creating these statistics allows the planner to generate more accurate estimates. It also notes the absence of a CREATE STATISTICS command in the SQL standard and provides references to related commands (ALTER STATISTICS and DROP STATISTICS).