Estimating Selectivity with Multiple Conditions and Joins

being more common than others are mostly due to sampling error). In a more typical case where some values are significantly more common than others, this complicated process gives a useful improvement in accuracy because the selectivity for the most common values is found exactly. </para> <para> Now let's consider a case with more than one condition in the <literal>WHERE</literal> clause: <programlisting> EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx'; QUERY PLAN -------------------------------------------------------------------&zwsp;------------- Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244) Recheck Cond: (unique1 < 1000) Filter: (stringu1 = 'xxx'::name) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0) Index Cond: (unique1 < 1000) </programlisting> The planner assumes that the two conditions are independent, so that the individual selectivities of the clauses can be multiplied together: <programlisting> selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx') = 0.100697 * 0.0014559 = 0.0001466 rows = 10000 * 0.0001466 = 1 (rounding off) </programlisting> Notice that the number of rows estimated to be returned from the bitmap index scan reflects only the condition used with the index; this is important since it affects the cost estimate for the subsequent heap fetches. </para> <para> Finally we will examine a query that involves a join: <programlisting> EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2; QUERY PLAN -------------------------------------------------------------------&zwsp;------------------- Nested Loop (cost=4.64..456.23 rows=50 width=488) -> Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244) Recheck Cond: (unique1 < 50) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0) Index Cond: (unique1 < 50) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244) Index Cond: (unique2 = t1.unique2) </programlisting> The restriction on <structname>tenk1</structname>, <literal>unique1 < 50</literal>, is evaluated before the nested-loop join. This is handled analogously to the previous range example. This time the value 50 falls into the first bucket of the <structfield>unique1</structfield> histogram: <programlisting> selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets = (0 + (50 - 0)/(993 - 0))/10 = 0.005035 rows = 10000 * 0.005035 = 50 (rounding off) </programlisting> The restriction for the join is <literal>t2.unique2 = t1.unique2</literal>. The operator is just our familiar <literal>=</literal>, however the selectivity function is obtained from the <structfield>oprjoin</structfield> column of <structname>pg_operator</structname>, and is <function>eqjoinsel</function>. <function>eqjoinsel</function> looks up the statistical information for both <structname>tenk2</structname> and <structname>tenk1</structname>: <programlisting> SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2'; tablename | null_frac | n_distinct | most_common_vals -----------+-----------+------------+------------------ tenk1 | 0 | -1 | tenk2 | 0 | -1 | </programlisting> In this case there is no <acronym>MCV</acronym> information for <structname>unique2</structname> and all the values appear to be unique (n_distinct = -1), so we use an algorithm that relies on the row count estimates for both relations (num_rows, not shown, but "tenk")

This section explains how PostgreSQL estimates selectivity for queries with multiple conditions in the WHERE clause and joins, including how the planner assumes independence between conditions, multiplies individual selectivities, and uses statistical information from system catalogs like pg_stats to estimate join selectivity, with examples of query plans and selectivity calculations for range conditions and equality joins.