Home Explore Blog CI



postgresql

9th chunk of `doc/src/sgml/planstats.sgml`
65ae0a875982bd6ffaae5e3606caed2073b42dfe4305ca130000000100000fa0

-------------------------------------------------------------------&zwsp;-------------------------
 HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1)
   Group Key: a, b
   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
</programlisting>
   </para>

  </sect2>

  <sect2 id="mcv-lists">
   <title>MCV Lists</title>

   <para>
    As explained in <xref linkend="functional-dependencies"/>, functional
    dependencies are very cheap and efficient type of statistics, but their
    main limitation is their global nature (only tracking dependencies at
    the column level, not between individual column values).
   </para>

   <para>
    This section introduces multivariate variant of <acronym>MCV</acronym>
    (most-common values) lists, a straightforward extension of the per-column
    statistics described in <xref linkend="row-estimation-examples"/>.  These
    statistics address the limitation by storing individual values, but it is
    naturally more expensive, both in terms of building the statistics in
    <command>ANALYZE</command>, storage and planning time.
   </para>

   <para>
    Let's look at the query from <xref linkend="functional-dependencies"/>
    again, but this time with a <acronym>MCV</acronym> list created on the
    same set of columns (be sure to drop the functional dependencies, to
    make sure the planner uses the newly created statistics).

<programlisting>
DROP STATISTICS stts;
CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
                                   QUERY PLAN
-------------------------------------------------------------------&zwsp;------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
</programlisting>

    The estimate is as accurate as with the functional dependencies, mostly
    thanks to the table being fairly small and having a simple distribution
    with a low number of distinct values. Before looking at the second query,
    which was not handled by functional dependencies particularly well,
    let's inspect the <acronym>MCV</acronym> list a bit.
   </para>

   <para>
    Inspecting the <acronym>MCV</acronym> list is possible using
    <function>pg_mcv_list_items</function> set-returning function.

<programlisting>
SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
                pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
 index |  values  | nulls | frequency | base_frequency
-------+----------+-------+-----------+----------------
     0 | {0, 0}   | {f,f} |      0.01 |         0.0001
     1 | {1, 1}   | {f,f} |      0.01 |         0.0001
   ...
    49 | {49, 49} | {f,f} |      0.01 |         0.0001
    50 | {50, 50} | {f,f} |      0.01 |         0.0001
   ...
    97 | {97, 97} | {f,f} |      0.01 |         0.0001
    98 | {98, 98} | {f,f} |      0.01 |         0.0001
    99 | {99, 99} | {f,f} |      0.01 |         0.0001
(100 rows)
</programlisting>

    This confirms there are 100 distinct combinations in the two columns, and
    all of them are about equally likely (1% frequency for each one).  The
    base frequency is the frequency computed from per-column statistics, as if
    there were no multi-column statistics. Had there been any null values in
    either of the columns, this would be identified in the
    <structfield>nulls</structfield> column.
   </para>

   <para>
    When estimating the selectivity, the planner applies all the conditions
    on items in the <acronym>MCV</acronym> list, and then sums the frequencies
    of the matching ones. See <function>mcv_clauselist_selectivity</function>
    in <filename>src/backend/statistics/mcv.c</filename> for details.
   </para>

   <para>
    Compared to functional dependencies, <acronym>MCV</acronym> lists have two
   

Title: MCV Lists for Multivariate Statistics
Summary
This section discusses the use of Most-Common Values (MCV) lists for multivariate statistics, which address the limitation of functional dependencies by storing individual values and providing more accurate estimates. MCV lists are created using the CREATE STATISTICS command and can be inspected using the pg_mcv_list_items function. The planner uses MCV lists to estimate selectivity by applying conditions to items in the list and summing the frequencies of matching ones, providing more accurate results than functional dependencies in certain cases.