Home Explore Blog CI



postgresql

5th chunk of `doc/src/sgml/planstats.sgml`
6ec00ce5e6deeb9efed51f63ea875cff72518206848977240000000100000fa1
 being more common than
   others are mostly due to sampling error).  In a more typical case where
   some values are significantly more common than others, this complicated
   process gives a useful improvement in accuracy because the selectivity
   for the most common values is found exactly.
  </para>

  <para>
   Now let's consider a case with more than one
   condition in the <literal>WHERE</literal> clause:

<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';

                                   QUERY PLAN
-------------------------------------------------------------------&zwsp;-------------
 Bitmap Heap Scan on tenk1  (cost=23.80..396.91 rows=1 width=244)
   Recheck Cond: (unique1 &lt; 1000)
   Filter: (stringu1 = 'xxx'::name)
   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
         Index Cond: (unique1 &lt; 1000)
</programlisting>

   The planner assumes that the two conditions are independent, so that
   the individual selectivities of the clauses can be multiplied together:

<programlisting>
selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
            = 0.100697 * 0.0014559
            = 0.0001466

rows        = 10000 * 0.0001466
            = 1  (rounding off)
</programlisting>

   Notice that the number of rows estimated to be returned from the bitmap
   index scan reflects only the condition used with the index; this is
   important since it affects the cost estimate for the subsequent heap
   fetches.
  </para>

  <para>
   Finally we will examine a query that involves a join:

<programlisting>
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;

                                      QUERY PLAN
-------------------------------------------------------------------&zwsp;-------------------
 Nested Loop  (cost=4.64..456.23 rows=50 width=488)
   -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.64..142.17 rows=50 width=244)
         Recheck Cond: (unique1 &lt; 50)
         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.63 rows=50 width=0)
               Index Cond: (unique1 &lt; 50)
   -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..6.27 rows=1 width=244)
         Index Cond: (unique2 = t1.unique2)
</programlisting>

   The restriction on <structname>tenk1</structname>,
   <literal>unique1 &lt; 50</literal>,
   is evaluated before the nested-loop join.
   This is handled analogously to the previous range example.  This time the
   value 50 falls into the first bucket of the
   <structfield>unique1</structfield> histogram:

<programlisting>
selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
            = (0 + (50 - 0)/(993 - 0))/10
            = 0.005035

rows        = 10000 * 0.005035
            = 50  (rounding off)
</programlisting>

   The restriction for the join is <literal>t2.unique2 = t1.unique2</literal>.
   The operator is just
   our familiar <literal>=</literal>, however the selectivity function is
   obtained from the <structfield>oprjoin</structfield> column of
   <structname>pg_operator</structname>, and is <function>eqjoinsel</function>.
   <function>eqjoinsel</function> looks up the statistical information for both
   <structname>tenk2</structname> and <structname>tenk1</structname>:

<programlisting>
SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';

tablename  | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
 tenk1     |         0 |         -1 |
 tenk2     |         0 |         -1 |
</programlisting>

   In this case there is no <acronym>MCV</acronym> information for
   <structname>unique2</structname> and all the values appear to be
   unique (n_distinct = -1), so we use an algorithm that relies on the row
   count estimates for both relations (num_rows, not shown, but "tenk")

Title: Estimating Selectivity with Multiple Conditions and Joins
Summary
This section explains how PostgreSQL estimates selectivity for queries with multiple conditions in the WHERE clause and joins, including how the planner assumes independence between conditions, multiplies individual selectivities, and uses statistical information from system catalogs like pg_stats to estimate join selectivity, with examples of query plans and selectivity calculations for range conditions and equality joins.