conditions are independent, and so it multiplies their selectivities,
producing a final selectivity estimate of just 0.01%.
This is a significant underestimate, as the actual number of rows
matching the conditions (100) is two orders of magnitude higher.
</para>
<para>
This problem can be fixed by creating a statistics object that
directs <command>ANALYZE</command> to calculate functional-dependency
multivariate statistics on the two columns:
<programlisting>
CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
</programlisting>
</para>
</sect2>
<sect2 id="multivariate-ndistinct-counts">
<title>Multivariate N-Distinct Counts</title>
<para>
A similar problem occurs with estimation of the cardinality of sets of
multiple columns, such as the number of groups that would be generated by
a <command>GROUP BY</command> clause. When <command>GROUP BY</command>
lists a single column, the n-distinct estimate (which is visible as the
estimated number of rows returned by the HashAggregate node) is very
accurate:
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
QUERY PLAN
-------------------------------------------------------------------&zwsp;----------------------
HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loops=1)
Group Key: a
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.00 loops=1)
</programlisting>
But without multivariate statistics, the estimate for the number of
groups in a query with two columns in <command>GROUP BY</command>, as
in the following example, is off by an order of magnitude:
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
-------------------------------------------------------------------&zwsp;-------------------------
HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
</programlisting>
By redefining the statistics object to include n-distinct counts for the
two columns, the estimate is much improved:
<programlisting>
DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
-------------------------------------------------------------------&zwsp;-------------------------
HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
</programlisting>
</para>
</sect2>
<sect2 id="mcv-lists">
<title>MCV Lists</title>
<para>
As explained in <xref linkend="functional-dependencies"/>, functional
dependencies are very cheap and efficient type of statistics, but their
main limitation is their global nature (only tracking dependencies at
the column level, not between individual column values).
</para>
<para>
This section introduces multivariate variant of <acronym>MCV</acronym>
(most-common values) lists, a straightforward extension of the per-column
statistics described in <xref linkend="row-estimation-examples"/>. These
statistics address the limitation by storing individual values, but it is
naturally more expensive,