Home Explore Blog CI



postgresql

8th chunk of `doc/src/sgml/planstats.sgml`
390d9ab098ccbb58b966f7920f86dde8ee8aaf22da4e91840000000100000fa9
 conditions are independent, and so it multiplies their selectivities,
    producing a final selectivity estimate of just 0.01%.
    This is a significant underestimate, as the actual number of rows
    matching the conditions (100) is two orders of magnitude higher.
   </para>

   <para>
    This problem can be fixed by creating a statistics object that
    directs <command>ANALYZE</command> to calculate functional-dependency
    multivariate statistics on the two columns:

<programlisting>
CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
                                  QUERY PLAN
-------------------------------------------------------------------&zwsp;------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
</programlisting>
   </para>
  </sect2>

  <sect2 id="multivariate-ndistinct-counts">
   <title>Multivariate N-Distinct Counts</title>

   <para>
    A similar problem occurs with estimation of the cardinality of sets of
    multiple columns, such as the number of groups that would be generated by
    a <command>GROUP BY</command> clause.  When <command>GROUP BY</command>
    lists a single column, the n-distinct estimate (which is visible as the
    estimated number of rows returned by the HashAggregate node) is very
    accurate:
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
                                       QUERY PLAN
-------------------------------------------------------------------&zwsp;----------------------
 HashAggregate  (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loops=1)
   Group Key: a
   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.00 loops=1)
</programlisting>
    But without multivariate statistics, the estimate for the number of
    groups in a query with two columns in <command>GROUP BY</command>, as
    in the following example, is off by an order of magnitude:
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
                                       QUERY PLAN
-------------------------------------------------------------------&zwsp;-------------------------
 HashAggregate  (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loops=1)
   Group Key: a, b
   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
</programlisting>
    By redefining the statistics object to include n-distinct counts for the
    two columns, the estimate is much improved:
<programlisting>
DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
                                       QUERY PLAN
-------------------------------------------------------------------&zwsp;-------------------------
 HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1)
   Group Key: a, b
   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
</programlisting>
   </para>

  </sect2>

  <sect2 id="mcv-lists">
   <title>MCV Lists</title>

   <para>
    As explained in <xref linkend="functional-dependencies"/>, functional
    dependencies are very cheap and efficient type of statistics, but their
    main limitation is their global nature (only tracking dependencies at
    the column level, not between individual column values).
   </para>

   <para>
    This section introduces multivariate variant of <acronym>MCV</acronym>
    (most-common values) lists, a straightforward extension of the per-column
    statistics described in <xref linkend="row-estimation-examples"/>.  These
    statistics address the limitation by storing individual values, but it is
    naturally more expensive,

Title: Multivariate Statistics
Summary
This section discusses the importance of multivariate statistics in improving the accuracy of query planning, particularly in cases where multiple columns are involved, such as in GROUP BY clauses or WHERE conditions with multiple columns. It highlights the limitations of traditional statistics and introduces new types of statistics, including functional dependencies, n-distinct counts, and MCV lists, which can provide more accurate estimates and better query planning.