Home Explore Blog CI



postgresql

24th chunk of `doc/src/sgml/perform.sgml`
1e1e9a149d96fec1fa3fb5e2f50a0b1b91ad802ba18344620000000100000fa3
 produce
     an underestimate.
    </para>

    <sect4 id="planner-stats-extended-functional-deps-limits">
     <title>Limitations of Functional Dependencies</title>

     <para>
      Functional dependencies are currently only applied when considering
      simple equality conditions that compare columns to constant values,
      and <literal>IN</literal> clauses with constant values.
      They are not used to improve estimates for equality conditions
      comparing two columns or comparing a column to an expression, nor for
      range clauses, <literal>LIKE</literal> or any other type of condition.
     </para>

     <para>
      When estimating with functional dependencies, the planner assumes that
      conditions on the involved columns are compatible and hence redundant.
      If they are incompatible, the correct estimate would be zero rows, but
      that possibility is not considered.  For example, given a query like
<programlisting>
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
</programlisting>
      the planner will disregard the <structfield>city</structfield> clause as not
      changing the selectivity, which is correct.  However, it will make
      the same assumption about
<programlisting>
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
</programlisting>
      even though there will really be zero rows satisfying this query.
      Functional dependency statistics do not provide enough information
      to conclude that, however.
     </para>

     <para>
      In many practical situations, this assumption is usually satisfied;
      for example, there might be a GUI in the application that only allows
      selecting compatible city and ZIP code values to use in a query.
      But if that's not the case, functional dependencies may not be a viable
      option.
     </para>
    </sect4>
   </sect3>

   <sect3 id="planner-stats-extended-n-distinct-counts">
    <title>Multivariate N-Distinct Counts</title>

    <para>
     Single-column statistics store the number of distinct values in each
     column.  Estimates of the number of distinct values when combining more
     than one column (for example, for <literal>GROUP BY a, b</literal>) are
     frequently wrong when the planner only has single-column statistical
     data, causing it to select bad plans.
    </para>

    <para>
     To improve such estimates, <command>ANALYZE</command> can collect n-distinct
     statistics for groups of columns.  As before, it's impractical to do
     this for every possible column grouping, so data is collected only for
     those groups of columns appearing together in a statistics object
     defined with the <literal>ndistinct</literal> option.  Data will be collected
     for each possible combination of two or more columns from the set of
     listed columns.
    </para>

    <para>
     Continuing the previous example, the n-distinct counts in a
     table of ZIP codes might look like the following:
<programlisting>
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;

ANALYZE zipcodes;

SELECT stxkeys AS k, stxdndistinct AS nd
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts2';
-[ RECORD 1 ]------------------------------------------------------&zwsp;--
k  | 1 2 5
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
(1 row)
</programlisting>
     This indicates that there are three combinations of columns that
     have 33178 distinct values: ZIP code and state; ZIP code and city;
     and ZIP code, city and state (the fact that they are all equal is
     expected given that ZIP code alone is unique in this table).  On the
     other hand, the combination of city and state has only 27435 distinct
     values.
    </para>

    <para>
     It's advisable to create <literal>ndistinct</literal> statistics objects only
     on combinations of columns that are actually used for grouping, and
     for

Title: Limitations of Functional Dependencies and N-Distinct Counts
Summary
Functional dependencies in PostgreSQL have limitations, such as only applying to simple equality conditions and not accounting for incompatible conditions, which can lead to incorrect estimates, and to address this, multivariate n-distinct counts can be collected using the ANALYZE command to improve estimates for grouping operations, but it's essential to create statistics objects only for relevant column combinations to avoid unnecessary overhead.