planner also does not
realize that the value of the date truncated to the month is fully
determined by the value of the date truncated to the day. Then expression
and ndistinct statistics are built on those two expressions:
<programlisting>
CREATE TABLE t3 (
a timestamp
);
INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
'2020-12-31'::timestamp,
'1 minute'::interval) s(i);
ANALYZE t3;
-- the number of matching rows will be drastically underestimated:
EXPLAIN ANALYZE SELECT * FROM t3
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
EXPLAIN ANALYZE SELECT * FROM t3
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
AND '2020-06-30'::timestamp;
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
FROM t3 GROUP BY 1, 2;
-- build ndistinct statistics on the pair of expressions (per-expression
-- statistics are built automatically)
CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
ANALYZE t3;
-- now the row count estimates are more accurate:
EXPLAIN ANALYZE SELECT * FROM t3
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
EXPLAIN ANALYZE SELECT * FROM t3
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
AND '2020-06-30'::timestamp;
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
FROM t3 GROUP BY 1, 2;
</programlisting>
Without expression and ndistinct statistics, the planner has no information
about the number of distinct values for the expressions, and has to rely
on default estimates. The equality and range conditions are assumed to have
0.5% selectivity, and the number of distinct values in the expression is
assumed to be the same as for the column (i.e. unique). This results in a
significant underestimate of the row count in the first two queries. Moreover,
the planner has no information about the relationship between the expressions,
so it assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
conditions are independent, and multiplies their selectivities together to
arrive at a severe overestimate of the group count in the aggregate query.
This is further exacerbated by the lack of accurate statistics for the
expressions, forcing the planner to use a default ndistinct estimate for the
expression derived from ndistinct for the column. With such statistics, the
planner recognizes that the conditions are correlated, and arrives at much
more accurate estimates.
</para>
</refsect1>
<refsect1>
<title>Compatibility</title>
<para>
There is no <command>CREATE STATISTICS</command> command in the SQL standard.
</para>
</refsect1>
<refsect1>
<title>See Also</title>
<simplelist type="inline">
<member><xref linkend="sql-alterstatistics"/></member>
<member><xref linkend="sql-dropstatistics"/></member>
</simplelist>
</refsect1>
</refentry>