null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';
null_frac | 0
n_distinct | 676
most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,&zwsp;JOAAAA,MCAAAA,NAAAAA,WGAAAA}
most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,&zwsp;0.003,0.003,0.003,0.003}
</programlisting>
Since <literal>CRAAAA</literal> appears in the list of MCVs, the selectivity is
merely the corresponding entry in the list of most common frequencies
(<acronym>MCF</acronym>s):
<programlisting>
selectivity = mcf[3]
= 0.003
</programlisting>
As before, the estimated number of rows is just the product of this with the
cardinality of <structname>tenk1</structname>:
<programlisting>
rows = 10000 * 0.003
= 30
</programlisting>
</para>
<para>
Now consider the same query, but with a constant that is not in the
<acronym>MCV</acronym> list:
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
QUERY PLAN
----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
Filter: (stringu1 = 'xxx'::name)
</programlisting>
This is quite a different problem: how to estimate the selectivity when the
value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list.
The approach is to use the fact that the value is not in the list,
combined with the knowledge of the frequencies for all of the
<acronym>MCV</acronym>s:
<programlisting>
selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)
= (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
= 0.0014559
</programlisting>
That is, add up all the frequencies for the <acronym>MCV</acronym>s and
subtract them from one, then
divide by the number of <emphasis>other</emphasis> distinct values.
This amounts to assuming that the fraction of the column that is not any
of the MCVs is evenly distributed among all the other distinct values.
Notice that there are no null values so we don't have to worry about those
(otherwise we'd subtract the null fraction from the numerator as well).
The estimated number of rows is then calculated as usual:
<programlisting>
rows = 10000 * 0.0014559
= 15 (rounding off)
</programlisting>
</para>
<para>
The previous example with <literal>unique1 < 1000</literal> was an
oversimplification of what <function>scalarltsel</function> really does;
now that we have seen an example of the use of MCVs, we can fill in some
more detail. The example was correct as far as it went, because since
<structfield>unique1</structfield> is a unique column it has no MCVs (obviously, no
value is any more common than any other value). For a non-unique
column, there will normally be both a histogram and an MCV list, and
<emphasis>the histogram does not include the portion of the column
population represented by the MCVs</emphasis>. We do things this way because
it allows more precise estimation. In this situation
<function>scalarltsel</function> directly applies the condition (e.g.,
<quote>< 1000</quote>) to each value of the MCV list, and adds up the
frequencies of the MCVs for which the condition is true. This gives
an exact estimate of the selectivity within the portion of the table
that is MCVs. The histogram is then used in the same way as above
to estimate the selectivity in the portion of the table that is not
MCVs, and then the two numbers are combined to estimate the overall
selectivity. For example, consider
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA';
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
Filter: (stringu1 <