Home Explore Blog CI



postgresql

6th chunk of `doc/src/sgml/planstats.sgml`
235567cb6a7e3e32e5d1fb5ec339bdfd2da0003d6ad746600000000100000fa2
 familiar <literal>=</literal>, however the selectivity function is
   obtained from the <structfield>oprjoin</structfield> column of
   <structname>pg_operator</structname>, and is <function>eqjoinsel</function>.
   <function>eqjoinsel</function> looks up the statistical information for both
   <structname>tenk2</structname> and <structname>tenk1</structname>:

<programlisting>
SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';

tablename  | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
 tenk1     |         0 |         -1 |
 tenk2     |         0 |         -1 |
</programlisting>

   In this case there is no <acronym>MCV</acronym> information for
   <structname>unique2</structname> and all the values appear to be
   unique (n_distinct = -1), so we use an algorithm that relies on the row
   count estimates for both relations (num_rows, not shown, but "tenk")
   together with the column null fractions (zero for both):

<programlisting>
selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
            = (1 - 0) * (1 - 0) / max(10000, 10000)
            = 0.0001
</programlisting>

   This is, subtract the null fraction from one for each of the relations,
   and divide by the row count of the larger relation (this value does get
   scaled in the non-unique case).
   The number of rows
   that the join is likely to emit is calculated as the cardinality of the
   Cartesian product of the two inputs, multiplied by the
   selectivity:

<programlisting>
rows = (outer_cardinality * inner_cardinality) * selectivity
     = (50 * 10000) * 0.0001
     = 50
</programlisting>
  </para>

  <para>
   Had there been MCV lists for the two columns,
   <function>eqjoinsel</function> would have used direct comparison of the MCV
   lists to determine the join selectivity within the part of the column
   populations represented by the MCVs.  The estimate for the remainder of the
   populations follows the same approach shown here.
  </para>

  <para>
   Notice that we showed <literal>inner_cardinality</literal> as 10000, that is,
   the unmodified size of <structname>tenk2</structname>.  It might appear from
   inspection of the <command>EXPLAIN</command> output that the estimate of
   join rows comes from 50 * 1, that is, the number of outer rows times
   the estimated number of rows obtained by each inner index scan on
   <structname>tenk2</structname>.  But this is not the case: the join relation size
   is estimated before any particular join plan has been considered.  If
   everything is working well then the two ways of estimating the join
   size will produce about the same answer, but due to round-off error and
   other factors they sometimes diverge significantly.
  </para>

  <para>
   For those interested in further details, estimation of the size of
   a table (before any <literal>WHERE</literal> clauses) is done in
   <filename>src/backend/optimizer/util/plancat.c</filename>. The generic
   logic for clause selectivities is in
   <filename>src/backend/optimizer/path/clausesel.c</filename>.  The
   operator-specific selectivity functions are mostly found
   in <filename>src/backend/utils/adt/selfuncs.c</filename>.
  </para>
 </sect1>

 <sect1 id="multivariate-statistics-examples">
  <title>Multivariate Statistics Examples</title>

  <indexterm>
   <primary>row estimation</primary>
   <secondary>multivariate</secondary>
  </indexterm>

  <sect2 id="functional-dependencies">
   <title>Functional Dependencies</title>

   <para>
    Multivariate correlation can be demonstrated with a very simple data set
    &mdash; a table with two columns, both containing the same values:

<programlisting>
CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;
</programlisting>

    As explained in <xref linkend="planner-stats"/>, the planner can

Title: Estimating Join Selectivity and Cardinality
Summary
This section discusses how PostgreSQL estimates the selectivity and cardinality of joins, including the use of statistical information from system catalogs, calculation of selectivity based on null fractions and row counts, and estimation of join rows using the cardinality of the Cartesian product of the two inputs multiplied by the selectivity, with examples and notes on the implementation details and potential discrepancies between different estimation methods.