familiar <literal>=</literal>, however the selectivity function is
obtained from the <structfield>oprjoin</structfield> column of
<structname>pg_operator</structname>, and is <function>eqjoinsel</function>.
<function>eqjoinsel</function> looks up the statistical information for both
<structname>tenk2</structname> and <structname>tenk1</structname>:
<programlisting>
SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
tablename | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
tenk1 | 0 | -1 |
tenk2 | 0 | -1 |
</programlisting>
In this case there is no <acronym>MCV</acronym> information for
<structname>unique2</structname> and all the values appear to be
unique (n_distinct = -1), so we use an algorithm that relies on the row
count estimates for both relations (num_rows, not shown, but "tenk")
together with the column null fractions (zero for both):
<programlisting>
selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001
</programlisting>
This is, subtract the null fraction from one for each of the relations,
and divide by the row count of the larger relation (this value does get
scaled in the non-unique case).
The number of rows
that the join is likely to emit is calculated as the cardinality of the
Cartesian product of the two inputs, multiplied by the
selectivity:
<programlisting>
rows = (outer_cardinality * inner_cardinality) * selectivity
= (50 * 10000) * 0.0001
= 50
</programlisting>
</para>
<para>
Had there been MCV lists for the two columns,
<function>eqjoinsel</function> would have used direct comparison of the MCV
lists to determine the join selectivity within the part of the column
populations represented by the MCVs. The estimate for the remainder of the
populations follows the same approach shown here.
</para>
<para>
Notice that we showed <literal>inner_cardinality</literal> as 10000, that is,
the unmodified size of <structname>tenk2</structname>. It might appear from
inspection of the <command>EXPLAIN</command> output that the estimate of
join rows comes from 50 * 1, that is, the number of outer rows times
the estimated number of rows obtained by each inner index scan on
<structname>tenk2</structname>. But this is not the case: the join relation size
is estimated before any particular join plan has been considered. If
everything is working well then the two ways of estimating the join
size will produce about the same answer, but due to round-off error and
other factors they sometimes diverge significantly.
</para>
<para>
For those interested in further details, estimation of the size of
a table (before any <literal>WHERE</literal> clauses) is done in
<filename>src/backend/optimizer/util/plancat.c</filename>. The generic
logic for clause selectivities is in
<filename>src/backend/optimizer/path/clausesel.c</filename>. The
operator-specific selectivity functions are mostly found
in <filename>src/backend/utils/adt/selfuncs.c</filename>.
</para>
</sect1>
<sect1 id="multivariate-statistics-examples">
<title>Multivariate Statistics Examples</title>
<indexterm>
<primary>row estimation</primary>
<secondary>multivariate</secondary>
</indexterm>
<sect2 id="functional-dependencies">
<title>Functional Dependencies</title>
<para>
Multivariate correlation can be demonstrated with a very simple data set
— a table with two columns, both containing the same values:
<programlisting>
CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;
</programlisting>
As explained in <xref linkend="planner-stats"/>, the planner can