Multivariate Statistics

conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher. </para> <para> This problem can be fixed by creating a statistics object that directs <command>ANALYZE</command> to calculate functional-dependency multivariate statistics on the two columns: <programlisting> CREATE STATISTICS stts (dependencies) ON a, b FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1; QUERY PLAN -------------------------------------------------------------------&zwsp;------------ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900 </programlisting> </para> </sect2> <sect2 id="multivariate-ndistinct-counts"> <title>Multivariate N-Distinct Counts</title> <para> A similar problem occurs with estimation of the cardinality of sets of multiple columns, such as the number of groups that would be generated by a <command>GROUP BY</command> clause. When <command>GROUP BY</command> lists a single column, the n-distinct estimate (which is visible as the estimated number of rows returned by the HashAggregate node) is very accurate: <programlisting> EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a; QUERY PLAN -------------------------------------------------------------------&zwsp;---------------------- HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loops=1) Group Key: a -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.00 loops=1) </programlisting> But without multivariate statistics, the estimate for the number of groups in a query with two columns in <command>GROUP BY</command>, as in the following example, is off by an order of magnitude: <programlisting> EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b; QUERY PLAN -------------------------------------------------------------------&zwsp;------------------------- HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loops=1) Group Key: a, b -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1) </programlisting> By redefining the statistics object to include n-distinct counts for the two columns, the estimate is much improved: <programlisting> DROP STATISTICS stts; CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b; QUERY PLAN -------------------------------------------------------------------&zwsp;------------------------- HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1) Group Key: a, b -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1) </programlisting> </para> </sect2> <sect2 id="mcv-lists"> <title>MCV Lists</title> <para> As explained in <xref linkend="functional-dependencies"/>, functional dependencies are very cheap and efficient type of statistics, but their main limitation is their global nature (only tracking dependencies at the column level, not between individual column values). </para> <para> This section introduces multivariate variant of <acronym>MCV</acronym> (most-common values) lists, a straightforward extension of the per-column statistics described in <xref linkend="row-estimation-examples"/>. These statistics address the limitation by storing individual values, but it is naturally more expensive,

This section discusses the importance of multivariate statistics in improving the accuracy of query planning, particularly in cases where multiple columns are involved, such as in GROUP BY clauses or WHERE conditions with multiple columns. It highlights the limitations of traditional statistics and introduces new types of statistics, including functional dependencies, n-distinct counts, and MCV lists, which can provide more accurate estimates and better query planning.