produce
an underestimate.
</para>
<sect4 id="planner-stats-extended-functional-deps-limits">
<title>Limitations of Functional Dependencies</title>
<para>
Functional dependencies are currently only applied when considering
simple equality conditions that compare columns to constant values,
and <literal>IN</literal> clauses with constant values.
They are not used to improve estimates for equality conditions
comparing two columns or comparing a column to an expression, nor for
range clauses, <literal>LIKE</literal> or any other type of condition.
</para>
<para>
When estimating with functional dependencies, the planner assumes that
conditions on the involved columns are compatible and hence redundant.
If they are incompatible, the correct estimate would be zero rows, but
that possibility is not considered. For example, given a query like
<programlisting>
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
</programlisting>
the planner will disregard the <structfield>city</structfield> clause as not
changing the selectivity, which is correct. However, it will make
the same assumption about
<programlisting>
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
</programlisting>
even though there will really be zero rows satisfying this query.
Functional dependency statistics do not provide enough information
to conclude that, however.
</para>
<para>
In many practical situations, this assumption is usually satisfied;
for example, there might be a GUI in the application that only allows
selecting compatible city and ZIP code values to use in a query.
But if that's not the case, functional dependencies may not be a viable
option.
</para>
</sect4>
</sect3>
<sect3 id="planner-stats-extended-n-distinct-counts">
<title>Multivariate N-Distinct Counts</title>
<para>
Single-column statistics store the number of distinct values in each
column. Estimates of the number of distinct values when combining more
than one column (for example, for <literal>GROUP BY a, b</literal>) are
frequently wrong when the planner only has single-column statistical
data, causing it to select bad plans.
</para>
<para>
To improve such estimates, <command>ANALYZE</command> can collect n-distinct
statistics for groups of columns. As before, it's impractical to do
this for every possible column grouping, so data is collected only for
those groups of columns appearing together in a statistics object
defined with the <literal>ndistinct</literal> option. Data will be collected
for each possible combination of two or more columns from the set of
listed columns.
</para>
<para>
Continuing the previous example, the n-distinct counts in a
table of ZIP codes might look like the following:
<programlisting>
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
ANALYZE zipcodes;
SELECT stxkeys AS k, stxdndistinct AS nd
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts2';
-[ RECORD 1 ]------------------------------------------------------&zwsp;--
k | 1 2 5
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
(1 row)
</programlisting>
This indicates that there are three combinations of columns that
have 33178 distinct values: ZIP code and state; ZIP code and city;
and ZIP code, city and state (the fact that they are all equal is
expected given that ZIP code alone is unique in this table). On the
other hand, the combination of city and state has only 27435 distinct
values.
</para>
<para>
It's advisable to create <literal>ndistinct</literal> statistics objects only
on combinations of columns that are actually used for grouping, and
for