Home Explore Blog CI



postgresql

7th chunk of `doc/src/sgml/planstats.sgml`
1067a7ed350e21c59c4f5bf9e09ca660ec115bc9e6d0a8900000000100000fbf
 done in
   <filename>src/backend/optimizer/util/plancat.c</filename>. The generic
   logic for clause selectivities is in
   <filename>src/backend/optimizer/path/clausesel.c</filename>.  The
   operator-specific selectivity functions are mostly found
   in <filename>src/backend/utils/adt/selfuncs.c</filename>.
  </para>
 </sect1>

 <sect1 id="multivariate-statistics-examples">
  <title>Multivariate Statistics Examples</title>

  <indexterm>
   <primary>row estimation</primary>
   <secondary>multivariate</secondary>
  </indexterm>

  <sect2 id="functional-dependencies">
   <title>Functional Dependencies</title>

   <para>
    Multivariate correlation can be demonstrated with a very simple data set
    &mdash; a table with two columns, both containing the same values:

<programlisting>
CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;
</programlisting>

    As explained in <xref linkend="planner-stats"/>, the planner can determine
    cardinality of <structname>t</structname> using the number of pages and
    rows obtained from <structname>pg_class</structname>:

<programlisting>
SELECT relpages, reltuples FROM pg_class WHERE relname = 't';

 relpages | reltuples
----------+-----------
       45 |     10000
</programlisting>

    The data distribution is very simple; there are only 100 distinct values
    in each column, uniformly distributed.
   </para>

   <para>
    The following example shows the result of estimating a <literal>WHERE</literal>
    condition on the <structfield>a</structfield> column:

<programlisting>
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1;
                                 QUERY PLAN
-------------------------------------------------------------------&zwsp;------------
 Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual rows=100.00 loops=1)
   Filter: (a = 1)
   Rows Removed by Filter: 9900
</programlisting>

    The planner examines the condition and determines the selectivity
    of this clause to be 1%.  By comparing this estimate and the actual
    number of rows, we see that the estimate is very accurate
    (in fact exact, as the table is very small).  Changing the
    <literal>WHERE</literal> condition to use the <structfield>b</structfield> column, an
    identical plan is generated.  But observe what happens if we apply the same
    condition on both columns, combining them with <literal>AND</literal>:

<programlisting>
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
                                 QUERY PLAN
-------------------------------------------------------------------&zwsp;----------
 Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100.00 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
</programlisting>

    The planner estimates the selectivity for each condition individually,
    arriving at the same 1% estimates as above.  Then it assumes that the
    conditions are independent, and so it multiplies their selectivities,
    producing a final selectivity estimate of just 0.01%.
    This is a significant underestimate, as the actual number of rows
    matching the conditions (100) is two orders of magnitude higher.
   </para>

   <para>
    This problem can be fixed by creating a statistics object that
    directs <command>ANALYZE</command> to calculate functional-dependency
    multivariate statistics on the two columns:

<programlisting>
CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
                                  QUERY PLAN
-------------------------------------------------------------------&zwsp;------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
</programlisting>
   </para>
  </sect2>

  <sect2 id="multivariate-ndistinct-counts">

Title: Multivariate Statistics Examples
Summary
This section provides examples of multivariate statistics in PostgreSQL, including a demonstration of functional dependencies between two columns, and how creating a statistics object with functional-dependency multivariate statistics can improve the accuracy of selectivity estimates for conditions involving multiple columns, especially when the conditions are not independent.