Home Explore Blog CI



postgresql

5th chunk of `doc/src/sgml/ref/create_statistics.sgml`
0e506e8ae210ef7ac58df6560186b64b602f16d19484b7ea0000000100000c15
 planner also does not
   realize that the value of the date truncated to the month is fully
   determined by the value of the date truncated to the day. Then expression
   and ndistinct statistics are built on those two expressions:

<programlisting>
CREATE TABLE t3 (
    a   timestamp
);

INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
                                             '2020-12-31'::timestamp,
                                             '1 minute'::interval) s(i);

ANALYZE t3;

-- the number of matching rows will be drastically underestimated:
EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;

EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
                                 AND '2020-06-30'::timestamp;

EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
   FROM t3 GROUP BY 1, 2;

-- build ndistinct statistics on the pair of expressions (per-expression
-- statistics are built automatically)
CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;

ANALYZE t3;

-- now the row count estimates are more accurate:
EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;

EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
                                 AND '2020-06-30'::timestamp;

EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
   FROM t3 GROUP BY 1, 2;
</programlisting>

   Without expression and ndistinct statistics, the planner has no information
   about the number of distinct values for the expressions, and has to rely
   on default estimates. The equality and range conditions are assumed to have
   0.5% selectivity, and the number of distinct values in the expression is
   assumed to be the same as for the column (i.e. unique). This results in a
   significant underestimate of the row count in the first two queries. Moreover,
   the planner has no information about the relationship between the expressions,
   so it assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
   conditions are independent, and multiplies their selectivities together to
   arrive at a severe overestimate of the group count in the aggregate query.
   This is further exacerbated by the lack of accurate statistics for the
   expressions, forcing the planner to use a default ndistinct estimate for the
   expression derived from ndistinct for the column. With such statistics, the
   planner recognizes that the conditions are correlated, and arrives at much
   more accurate estimates.
  </para>

 </refsect1>

 <refsect1>
  <title>Compatibility</title>

  <para>
   There is no <command>CREATE STATISTICS</command> command in the SQL standard.
  </para>
 </refsect1>

 <refsect1>
  <title>See Also</title>

  <simplelist type="inline">
   <member><xref linkend="sql-alterstatistics"/></member>
   <member><xref linkend="sql-dropstatistics"/></member>
  </simplelist>
 </refsect1>
</refentry>

Title: Timestamp Expression Statistics and Compatibility
Summary
This section elaborates on using expression and ndistinct statistics on timestamp columns. Without these statistics, the query planner underestimates the row count and makes incorrect assumptions about the relationship between date truncation expressions (month and day). Creating these statistics allows the planner to generate more accurate estimates. It also notes the absence of a CREATE STATISTICS command in the SQL standard and provides references to related commands (ALTER STATISTICS and DROP STATISTICS).