MCV Lists for Multivariate Statistics

-------------------------------------------------------------------&zwsp;------------------------- HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1) Group Key: a, b -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1) </programlisting> </para> </sect2> <sect2 id="mcv-lists"> <title>MCV Lists</title> <para> As explained in <xref linkend="functional-dependencies"/>, functional dependencies are very cheap and efficient type of statistics, but their main limitation is their global nature (only tracking dependencies at the column level, not between individual column values). </para> <para> This section introduces multivariate variant of <acronym>MCV</acronym> (most-common values) lists, a straightforward extension of the per-column statistics described in <xref linkend="row-estimation-examples"/>. These statistics address the limitation by storing individual values, but it is naturally more expensive, both in terms of building the statistics in <command>ANALYZE</command>, storage and planning time. </para> <para> Let's look at the query from <xref linkend="functional-dependencies"/> again, but this time with a <acronym>MCV</acronym> list created on the same set of columns (be sure to drop the functional dependencies, to make sure the planner uses the newly created statistics). <programlisting> DROP STATISTICS stts; CREATE STATISTICS stts2 (mcv) ON a, b FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1; QUERY PLAN -------------------------------------------------------------------&zwsp;------------ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900 </programlisting> The estimate is as accurate as with the functional dependencies, mostly thanks to the table being fairly small and having a simple distribution with a low number of distinct values. Before looking at the second query, which was not handled by functional dependencies particularly well, let's inspect the <acronym>MCV</acronym> list a bit. </para> <para> Inspecting the <acronym>MCV</acronym> list is possible using <function>pg_mcv_list_items</function> set-returning function. <programlisting> SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid), pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2'; index | values | nulls | frequency | base_frequency -------+----------+-------+-----------+---------------- 0 | {0, 0} | {f,f} | 0.01 | 0.0001 1 | {1, 1} | {f,f} | 0.01 | 0.0001 ... 49 | {49, 49} | {f,f} | 0.01 | 0.0001 50 | {50, 50} | {f,f} | 0.01 | 0.0001 ... 97 | {97, 97} | {f,f} | 0.01 | 0.0001 98 | {98, 98} | {f,f} | 0.01 | 0.0001 99 | {99, 99} | {f,f} | 0.01 | 0.0001 (100 rows) </programlisting> This confirms there are 100 distinct combinations in the two columns, and all of them are about equally likely (1% frequency for each one). The base frequency is the frequency computed from per-column statistics, as if there were no multi-column statistics. Had there been any null values in either of the columns, this would be identified in the <structfield>nulls</structfield> column. </para> <para> When estimating the selectivity, the planner applies all the conditions on items in the <acronym>MCV</acronym> list, and then sums the frequencies of the matching ones. See <function>mcv_clauselist_selectivity</function> in <filename>src/backend/statistics/mcv.c</filename> for details. </para> <para> Compared to functional dependencies, <acronym>MCV</acronym> lists have two

This section discusses the use of Most-Common Values (MCV) lists for multivariate statistics, which address the limitation of functional dependencies by storing individual values and providing more accurate estimates. MCV lists are created using the CREATE STATISTICS command and can be inspected using the pg_mcv_list_items function. The planner uses MCV lists to estimate selectivity by applying conditions to items in the list and summing the frequencies of matching ones, providing more accurate results than functional dependencies in certain cases.