Multivariate Statistics Examples

done in <filename>src/backend/optimizer/util/plancat.c</filename>. The generic logic for clause selectivities is in <filename>src/backend/optimizer/path/clausesel.c</filename>. The operator-specific selectivity functions are mostly found in <filename>src/backend/utils/adt/selfuncs.c</filename>. </para> </sect1> <sect1 id="multivariate-statistics-examples"> <title>Multivariate Statistics Examples</title> <indexterm> <primary>row estimation</primary> <secondary>multivariate</secondary> </indexterm> <sect2 id="functional-dependencies"> <title>Functional Dependencies</title> <para> Multivariate correlation can be demonstrated with a very simple data set — a table with two columns, both containing the same values: <programlisting> CREATE TABLE t (a INT, b INT); INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i); ANALYZE t; </programlisting> As explained in <xref linkend="planner-stats"/>, the planner can determine cardinality of <structname>t</structname> using the number of pages and rows obtained from <structname>pg_class</structname>: <programlisting> SELECT relpages, reltuples FROM pg_class WHERE relname = 't'; relpages | reltuples ----------+----------- 45 | 10000 </programlisting> The data distribution is very simple; there are only 100 distinct values in each column, uniformly distributed. </para> <para> The following example shows the result of estimating a <literal>WHERE</literal> condition on the <structfield>a</structfield> column: <programlisting> EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1; QUERY PLAN -------------------------------------------------------------------&zwsp;------------ Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100.00 loops=1) Filter: (a = 1) Rows Removed by Filter: 9900 </programlisting> The planner examines the condition and determines the selectivity of this clause to be 1%. By comparing this estimate and the actual number of rows, we see that the estimate is very accurate (in fact exact, as the table is very small). Changing the <literal>WHERE</literal> condition to use the <structfield>b</structfield> column, an identical plan is generated. But observe what happens if we apply the same condition on both columns, combining them with <literal>AND</literal>: <programlisting> EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1; QUERY PLAN -------------------------------------------------------------------&zwsp;---------- Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100.00 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900 </programlisting> The planner estimates the selectivity for each condition individually, arriving at the same 1% estimates as above. Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher. </para> <para> This problem can be fixed by creating a statistics object that directs <command>ANALYZE</command> to calculate functional-dependency multivariate statistics on the two columns: <programlisting> CREATE STATISTICS stts (dependencies) ON a, b FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1; QUERY PLAN -------------------------------------------------------------------&zwsp;------------ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900 </programlisting> </para> </sect2> <sect2 id="multivariate-ndistinct-counts">

This section provides examples of multivariate statistics in PostgreSQL, including a demonstration of functional dependencies between two columns, and how creating a statistics object with functional-dependency multivariate statistics can improve the accuracy of selectivity estimates for conditions involving multiple columns, especially when the conditions are not independent.