Home Explore Blog CI



postgresql

4th chunk of `doc/src/sgml/planstats.sgml`
e56ee361681ff207e30c001a0d8e3a640e1241fdfc53c3450000000100000fa0
 <emphasis>the histogram does not include the portion of the column
   population represented by the MCVs</emphasis>.  We do things this way because
   it allows more precise estimation.  In this situation
   <function>scalarltsel</function> directly applies the condition (e.g.,
   <quote>&lt; 1000</quote>) to each value of the MCV list, and adds up the
   frequencies of the MCVs for which the condition is true.  This gives
   an exact estimate of the selectivity within the portion of the table
   that is MCVs.  The histogram is then used in the same way as above
   to estimate the selectivity in the portion of the table that is not
   MCVs, and then the two numbers are combined to estimate the overall
   selectivity.  For example, consider

<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';

                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..483.00 rows=3077 width=244)
   Filter: (stringu1 &lt; 'IAAAAA'::name)
</programlisting>

   We already saw the MCV information for <structfield>stringu1</structfield>,
   and here is its histogram:

<programlisting>
SELECT histogram_bounds FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';

                                histogram_bounds
-------------------------------------------------------------------&zwsp;-------------
 {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,&zwsp;XLAAAA,ZZAAAA}
</programlisting>

   Checking the MCV list, we find that the condition <literal>stringu1 &lt;
   'IAAAAA'</literal> is satisfied by the first six entries and not the last four,
   so the selectivity within the MCV part of the population is

<programlisting>
selectivity = sum(relevant mvfs)
            = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
            = 0.01833333
</programlisting>

   Summing all the MCFs also tells us that the total fraction of the
   population represented by MCVs is 0.03033333, and therefore the
   fraction represented by the histogram is 0.96966667 (again, there
   are no nulls, else we'd have to exclude them here).  We can see
   that the value <literal>IAAAAA</literal> falls nearly at the end of the
   third histogram bucket.  Using some rather cheesy assumptions
   about the frequency of different characters, the planner arrives
   at the estimate 0.298387 for the portion of the histogram population
   that is less than <literal>IAAAAA</literal>.  We then combine the estimates
   for the MCV and non-MCV populations:

<programlisting>
selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
            = 0.01833333 + 0.298387 * 0.96966667
            = 0.307669

rows        = 10000 * 0.307669
            = 3077  (rounding off)
</programlisting>

   In this particular example, the correction from the MCV list is fairly
   small, because the column distribution is actually quite flat (the
   statistics showing these particular values as being more common than
   others are mostly due to sampling error).  In a more typical case where
   some values are significantly more common than others, this complicated
   process gives a useful improvement in accuracy because the selectivity
   for the most common values is found exactly.
  </para>

  <para>
   Now let's consider a case with more than one
   condition in the <literal>WHERE</literal> clause:

<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';

                                   QUERY PLAN
-------------------------------------------------------------------&zwsp;-------------
 Bitmap Heap Scan on tenk1  (cost=23.80..396.91 rows=1 width=244)
   Recheck Cond: (unique1 &lt; 1000)
   Filter: (stringu1 = 'xxx'::name)
   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
         Index Cond: (unique1 &lt; 1000)
</programlisting>

   The planner assumes that the two conditions are independent, so that
   the

Title: Estimating Selectivity with MCVs, Histograms, and Multiple Conditions
Summary
This section explains how PostgreSQL estimates selectivity for queries with range conditions and multiple conditions in the WHERE clause, using most common values (MCVs), histograms, and assumptions of independence between conditions, including examples of how the planner combines estimates for MCV and non-MCV populations and handles conditions with low selectivity.