Home Explore Blog CI



postgresql

25th chunk of `doc/src/sgml/perform.sgml`
59990654c03099b86f64b3a69f0c3ce19c5752314320b7b10000000100000fa0
 following:
<programlisting>
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;

ANALYZE zipcodes;

SELECT stxkeys AS k, stxdndistinct AS nd
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts2';
-[ RECORD 1 ]------------------------------------------------------&zwsp;--
k  | 1 2 5
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
(1 row)
</programlisting>
     This indicates that there are three combinations of columns that
     have 33178 distinct values: ZIP code and state; ZIP code and city;
     and ZIP code, city and state (the fact that they are all equal is
     expected given that ZIP code alone is unique in this table).  On the
     other hand, the combination of city and state has only 27435 distinct
     values.
    </para>

    <para>
     It's advisable to create <literal>ndistinct</literal> statistics objects only
     on combinations of columns that are actually used for grouping, and
     for which misestimation of the number of groups is resulting in bad
     plans.  Otherwise, the <command>ANALYZE</command> cycles are just wasted.
    </para>
   </sect3>

   <sect3 id="planner-stats-extended-mcv-lists">
    <title>Multivariate MCV Lists</title>

    <para>
     Another type of statistic stored for each column are most-common value
     lists.  This allows very accurate estimates for individual columns, but
     may result in significant misestimates for queries with conditions on
     multiple columns.
    </para>

    <para>
     To improve such estimates, <command>ANALYZE</command> can collect MCV
     lists on combinations of columns.  Similarly to functional dependencies
     and n-distinct coefficients, it's impractical to do this for every
     possible column grouping.  Even more so in this case, as the MCV list
     (unlike functional dependencies and n-distinct coefficients) does store
     the common column values.  So data is collected only for those groups
     of columns appearing together in a statistics object defined with the
     <literal>mcv</literal> option.
    </para>

    <para>
     Continuing the previous example, the MCV list for a table of ZIP codes
     might look like the following (unlike for simpler types of statistics,
     a function is required for inspection of MCV contents):

<programlisting>
CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;

ANALYZE zipcodes;

SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
                pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';

 index |         values         | nulls | frequency | base_frequency
-------+------------------------+-------+-----------+----------------
     0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
     1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
     2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
     3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
     4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
     5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
     6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
     7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
     8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
     9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
   ...
(99 rows)
</programlisting>
     This indicates that the most common combination of city and state is
     Washington in DC, with actual frequency (in the sample) about 0.35%.
     The base frequency of the combination (as computed from the simple
     per-column frequencies) is only 0.0027%, resulting in two orders of
     magnitude under-estimates.
    </para>

    <para>
     It's advisable to create <acronym>MCV</acronym> statistics objects only
     on combinations of columns that are actually used in conditions together,
     and for which misestimation of

Title: Multivariate Statistics for Improved Query Planning
Summary
PostgreSQL supports collecting multivariate statistics, such as n-distinct counts and most-common value (MCV) lists, to improve query planning estimates for conditions involving multiple columns, and it's recommended to create these statistics objects only for relevant column combinations to avoid unnecessary overhead and ensure accurate estimates.