Estimating Join Selectivity and Cardinality

familiar <literal>=</literal>, however the selectivity function is obtained from the <structfield>oprjoin</structfield> column of <structname>pg_operator</structname>, and is <function>eqjoinsel</function>. <function>eqjoinsel</function> looks up the statistical information for both <structname>tenk2</structname> and <structname>tenk1</structname>: <programlisting> SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2'; tablename | null_frac | n_distinct | most_common_vals -----------+-----------+------------+------------------ tenk1 | 0 | -1 | tenk2 | 0 | -1 | </programlisting> In this case there is no <acronym>MCV</acronym> information for <structname>unique2</structname> and all the values appear to be unique (n_distinct = -1), so we use an algorithm that relies on the row count estimates for both relations (num_rows, not shown, but "tenk") together with the column null fractions (zero for both): <programlisting> selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2) = (1 - 0) * (1 - 0) / max(10000, 10000) = 0.0001 </programlisting> This is, subtract the null fraction from one for each of the relations, and divide by the row count of the larger relation (this value does get scaled in the non-unique case). The number of rows that the join is likely to emit is calculated as the cardinality of the Cartesian product of the two inputs, multiplied by the selectivity: <programlisting> rows = (outer_cardinality * inner_cardinality) * selectivity = (50 * 10000) * 0.0001 = 50 </programlisting> </para> <para> Had there been MCV lists for the two columns, <function>eqjoinsel</function> would have used direct comparison of the MCV lists to determine the join selectivity within the part of the column populations represented by the MCVs. The estimate for the remainder of the populations follows the same approach shown here. </para> <para> Notice that we showed <literal>inner_cardinality</literal> as 10000, that is, the unmodified size of <structname>tenk2</structname>. It might appear from inspection of the <command>EXPLAIN</command> output that the estimate of join rows comes from 50 * 1, that is, the number of outer rows times the estimated number of rows obtained by each inner index scan on <structname>tenk2</structname>. But this is not the case: the join relation size is estimated before any particular join plan has been considered. If everything is working well then the two ways of estimating the join size will produce about the same answer, but due to round-off error and other factors they sometimes diverge significantly. </para> <para> For those interested in further details, estimation of the size of a table (before any <literal>WHERE</literal> clauses) is done in <filename>src/backend/optimizer/util/plancat.c</filename>. The generic logic for clause selectivities is in <filename>src/backend/optimizer/path/clausesel.c</filename>. The operator-specific selectivity functions are mostly found in <filename>src/backend/utils/adt/selfuncs.c</filename>. </para> </sect1> <sect1 id="multivariate-statistics-examples"> <title>Multivariate Statistics Examples</title> <indexterm> <primary>row estimation</primary> <secondary>multivariate</secondary> </indexterm> <sect2 id="functional-dependencies"> <title>Functional Dependencies</title> <para> Multivariate correlation can be demonstrated with a very simple data set — a table with two columns, both containing the same values: <programlisting> CREATE TABLE t (a INT, b INT); INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i); ANALYZE t; </programlisting> As explained in <xref linkend="planner-stats"/>, the planner can

This section discusses how PostgreSQL estimates the selectivity and cardinality of joins, including the use of statistical information from system catalogs, calculation of selectivity based on null fractions and row counts, and estimation of join rows using the cardinality of the Cartesian product of the two inputs multiplied by the selectivity, with examples and notes on the implementation details and potential discrepancies between different estimation methods.