Home Explore Blog CI



postgresql

3rd chunk of `doc/src/sgml/planstats.sgml`
b60345b92f3871277af6fd5bd5062dcb332964191a7536320000000100000fa2
 null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';

null_frac         | 0
n_distinct        | 676
most_common_vals  | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,&zwsp;JOAAAA,MCAAAA,NAAAAA,WGAAAA}
most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,&zwsp;0.003,0.003,0.003,0.003}

</programlisting>

   Since <literal>CRAAAA</literal> appears in the list of MCVs, the selectivity is
   merely the corresponding entry in the list of most common frequencies
   (<acronym>MCF</acronym>s):

<programlisting>
selectivity = mcf[3]
            = 0.003
</programlisting>

   As before, the estimated number of rows is just the product of this with the
   cardinality of <structname>tenk1</structname>:

<programlisting>
rows = 10000 * 0.003
     = 30
</programlisting>
  </para>

  <para>
   Now consider the same query, but with a constant that is not in the
   <acronym>MCV</acronym> list:

<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';

                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..483.00 rows=15 width=244)
   Filter: (stringu1 = 'xxx'::name)
</programlisting>

   This is quite a different problem: how to estimate the selectivity when the
   value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list.
   The approach is to use the fact that the value is not in the list,
   combined with the knowledge of the frequencies for all of the
   <acronym>MCV</acronym>s:

<programlisting>
selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)
            = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
                    0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
            = 0.0014559
</programlisting>

   That is, add up all the frequencies for the <acronym>MCV</acronym>s and
   subtract them from one, then
   divide by the number of <emphasis>other</emphasis> distinct values.
   This amounts to assuming that the fraction of the column that is not any
   of the MCVs is evenly distributed among all the other distinct values.
   Notice that there are no null values so we don't have to worry about those
   (otherwise we'd subtract the null fraction from the numerator as well).
   The estimated number of rows is then calculated as usual:

<programlisting>
rows = 10000 * 0.0014559
     = 15  (rounding off)
</programlisting>
  </para>

  <para>
   The previous example with <literal>unique1 &lt; 1000</literal> was an
   oversimplification of what <function>scalarltsel</function> really does;
   now that we have seen an example of the use of MCVs, we can fill in some
   more detail.  The example was correct as far as it went, because since
   <structfield>unique1</structfield> is a unique column it has no MCVs (obviously, no
   value is any more common than any other value).  For a non-unique
   column, there will normally be both a histogram and an MCV list, and
   <emphasis>the histogram does not include the portion of the column
   population represented by the MCVs</emphasis>.  We do things this way because
   it allows more precise estimation.  In this situation
   <function>scalarltsel</function> directly applies the condition (e.g.,
   <quote>&lt; 1000</quote>) to each value of the MCV list, and adds up the
   frequencies of the MCVs for which the condition is true.  This gives
   an exact estimate of the selectivity within the portion of the table
   that is MCVs.  The histogram is then used in the same way as above
   to estimate the selectivity in the portion of the table that is not
   MCVs, and then the two numbers are combined to estimate the overall
   selectivity.  For example, consider

<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';

                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..483.00 rows=3077 width=244)
   Filter: (stringu1 &lt;

Title: Selectivity Estimation with MCVs and Histograms
Summary
This section describes how PostgreSQL estimates selectivity for queries with range conditions and equality conditions, using most common values (MCVs) and histograms, including cases where the value is in the MCV list, not in the MCV list, and how the estimator combines MCVs and histograms to estimate selectivity for non-unique columns.