Estimating Selectivity with MCVs, Histograms, and Multiple Conditions

<emphasis>the histogram does not include the portion of the column population represented by the MCVs</emphasis>. We do things this way because it allows more precise estimation. In this situation <function>scalarltsel</function> directly applies the condition (e.g., <quote>< 1000</quote>) to each value of the MCV list, and adds up the frequencies of the MCVs for which the condition is true. This gives an exact estimate of the selectivity within the portion of the table that is MCVs. The histogram is then used in the same way as above to estimate the selectivity in the portion of the table that is not MCVs, and then the two numbers are combined to estimate the overall selectivity. For example, consider <programlisting> EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA'; QUERY PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244) Filter: (stringu1 < 'IAAAAA'::name) </programlisting> We already saw the MCV information for <structfield>stringu1</structfield>, and here is its histogram: <programlisting> SELECT histogram_bounds FROM pg_stats WHERE tablename='tenk1' AND attname='stringu1'; histogram_bounds -------------------------------------------------------------------&zwsp;------------- {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,&zwsp;XLAAAA,ZZAAAA} </programlisting> Checking the MCV list, we find that the condition <literal>stringu1 < 'IAAAAA'</literal> is satisfied by the first six entries and not the last four, so the selectivity within the MCV part of the population is <programlisting> selectivity = sum(relevant mvfs) = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 = 0.01833333 </programlisting> Summing all the MCFs also tells us that the total fraction of the population represented by MCVs is 0.03033333, and therefore the fraction represented by the histogram is 0.96966667 (again, there are no nulls, else we'd have to exclude them here). We can see that the value <literal>IAAAAA</literal> falls nearly at the end of the third histogram bucket. Using some rather cheesy assumptions about the frequency of different characters, the planner arrives at the estimate 0.298387 for the portion of the histogram population that is less than <literal>IAAAAA</literal>. We then combine the estimates for the MCV and non-MCV populations: <programlisting> selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction = 0.01833333 + 0.298387 * 0.96966667 = 0.307669 rows = 10000 * 0.307669 = 3077 (rounding off) </programlisting> In this particular example, the correction from the MCV list is fairly small, because the column distribution is actually quite flat (the statistics showing these particular values as being more common than others are mostly due to sampling error). In a more typical case where some values are significantly more common than others, this complicated process gives a useful improvement in accuracy because the selectivity for the most common values is found exactly. </para> <para> Now let's consider a case with more than one condition in the <literal>WHERE</literal> clause: <programlisting> EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx'; QUERY PLAN -------------------------------------------------------------------&zwsp;------------- Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244) Recheck Cond: (unique1 < 1000) Filter: (stringu1 = 'xxx'::name) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0) Index Cond: (unique1 < 1000) </programlisting> The planner assumes that the two conditions are independent, so that the

This section explains how PostgreSQL estimates selectivity for queries with range conditions and multiple conditions in the WHERE clause, using most common values (MCVs), histograms, and assumptions of independence between conditions, including examples of how the planner combines estimates for MCV and non-MCV populations and handles conditions with low selectivity.