Selectivity Estimation with MCVs and Histograms

null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename='tenk1' AND attname='stringu1'; null_frac | 0 n_distinct | 676 most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,&zwsp;JOAAAA,MCAAAA,NAAAAA,WGAAAA} most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,&zwsp;0.003,0.003,0.003,0.003} </programlisting> Since <literal>CRAAAA</literal> appears in the list of MCVs, the selectivity is merely the corresponding entry in the list of most common frequencies (<acronym>MCF</acronym>s): <programlisting> selectivity = mcf[3] = 0.003 </programlisting> As before, the estimated number of rows is just the product of this with the cardinality of <structname>tenk1</structname>: <programlisting> rows = 10000 * 0.003 = 30 </programlisting> </para> <para> Now consider the same query, but with a constant that is not in the <acronym>MCV</acronym> list: <programlisting> EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx'; QUERY PLAN ---------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244) Filter: (stringu1 = 'xxx'::name) </programlisting> This is quite a different problem: how to estimate the selectivity when the value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list. The approach is to use the fact that the value is not in the list, combined with the knowledge of the frequencies for all of the <acronym>MCV</acronym>s: <programlisting> selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv) = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10) = 0.0014559 </programlisting> That is, add up all the frequencies for the <acronym>MCV</acronym>s and subtract them from one, then divide by the number of <emphasis>other</emphasis> distinct values. This amounts to assuming that the fraction of the column that is not any of the MCVs is evenly distributed among all the other distinct values. Notice that there are no null values so we don't have to worry about those (otherwise we'd subtract the null fraction from the numerator as well). The estimated number of rows is then calculated as usual: <programlisting> rows = 10000 * 0.0014559 = 15 (rounding off) </programlisting> </para> <para> The previous example with <literal>unique1 < 1000</literal> was an oversimplification of what <function>scalarltsel</function> really does; now that we have seen an example of the use of MCVs, we can fill in some more detail. The example was correct as far as it went, because since <structfield>unique1</structfield> is a unique column it has no MCVs (obviously, no value is any more common than any other value). For a non-unique column, there will normally be both a histogram and an MCV list, and <emphasis>the histogram does not include the portion of the column population represented by the MCVs</emphasis>. We do things this way because it allows more precise estimation. In this situation <function>scalarltsel</function> directly applies the condition (e.g., <quote>< 1000</quote>) to each value of the MCV list, and adds up the frequencies of the MCVs for which the condition is true. This gives an exact estimate of the selectivity within the portion of the table that is MCVs. The histogram is then used in the same way as above to estimate the selectivity in the portion of the table that is not MCVs, and then the two numbers are combined to estimate the overall selectivity. For example, consider <programlisting> EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA'; QUERY PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244) Filter: (stringu1 <

This section describes how PostgreSQL estimates selectivity for queries with range conditions and equality conditions, using most common values (MCVs) and histograms, including cases where the value is in the MCV list, not in the MCV list, and how the estimator combines MCVs and histograms to estimate selectivity for non-unique columns.