correlated columns
(containing identical data), and an MCV list on those columns:
<programlisting>
CREATE TABLE t2 (
a int,
b int
);
INSERT INTO t2 SELECT mod(i,100), mod(i,100)
FROM generate_series(1,1000000) s(i);
CREATE STATISTICS s2 (mcv) ON a, b FROM t2;
ANALYZE t2;
-- valid combination (found in MCV)
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
-- invalid combination (not found in MCV)
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
</programlisting>
The MCV list gives the planner more detailed information about the
specific values that commonly appear in the table, as well as an upper
bound on the selectivities of combinations of values that do not appear
in the table, allowing it to generate better estimates in both cases.
</para>
<para>
Create table <structname>t3</structname> with a single timestamp column,
and run queries using expressions on that column. Without extended
statistics, the planner has no information about the data distribution for
the expressions, and uses default estimates. The planner also does not
realize that the value of the date truncated to the month is fully
determined by the value of the date truncated to the day. Then expression
and ndistinct statistics are built on those two expressions:
<programlisting>
CREATE TABLE t3 (
a timestamp
);
INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
'2020-12-31'::timestamp,
'1 minute'::interval) s(i);
ANALYZE t3;
-- the number of matching rows will be drastically underestimated:
EXPLAIN ANALYZE SELECT * FROM t3
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
EXPLAIN ANALYZE SELECT * FROM t3
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
AND '2020-06-30'::timestamp;
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
FROM t3 GROUP BY 1, 2;
-- build ndistinct statistics on the pair of expressions (per-expression