functionally dependent on
column <structfield>a</structfield> if knowledge of the value of
<structfield>a</structfield> is sufficient to determine the value
of <structfield>b</structfield>, that is there are no two rows having the same value
of <structfield>a</structfield> but different values of <structfield>b</structfield>.
In a fully normalized database, functional dependencies should exist
only on primary keys and superkeys. However, in practice many data sets
are not fully normalized for various reasons; intentional
denormalization for performance reasons is a common example.
Even in a fully normalized database, there may be partial correlation
between some columns, which can be expressed as partial functional
dependency.
</para>
<para>
The existence of functional dependencies directly affects the accuracy
of estimates in certain queries. If a query contains conditions on
both the independent and the dependent column(s), the
conditions on the dependent columns do not further reduce the result
size; but without knowledge of the functional dependency, the query
planner will assume that the conditions are independent, resulting
in underestimating the result size.
</para>
<para>
To inform the planner about functional dependencies, <command>ANALYZE</command>
can collect measurements of cross-column dependency. Assessing the
degree of dependency between all sets of columns would be prohibitively
expensive, so data collection is limited to those groups of columns
appearing together in a statistics object defined with
the <literal>dependencies</literal> option. It is advisable to create
<literal>dependencies</literal> statistics only for column groups that are
strongly correlated, to avoid unnecessary overhead in both
<command>ANALYZE</command> and later query planning.
</para>
<para>
Here is an example of collecting functional-dependency statistics:
<programlisting>
CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
ANALYZE zipcodes;
SELECT stxname, stxkeys, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts';
stxname | stxkeys | stxddependencies
---------+---------+------------------------------------------
stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130}
(1 row)
</programlisting>
Here it can be seen that column 1 (zip code) fully determines column
5 (city) so the coefficient is 1.0, while city only determines zip code
about 42% of the time, meaning that there are many cities (58%) that are
represented by more than a single ZIP code.
</para>
<para>
When computing the selectivity for a query involving functionally
dependent columns, the planner adjusts the per-condition selectivity
estimates using the dependency coefficients so as not to produce
an underestimate.
</para>
<sect4 id="planner-stats-extended-functional-deps-limits">
<title>Limitations of Functional Dependencies</title>
<para>
Functional dependencies are currently only applied when considering
simple equality conditions that compare columns to constant values,
and <literal>IN</literal> clauses with constant values.
They are not used to improve estimates for equality conditions
comparing two columns or comparing a column to an expression, nor for
range clauses, <literal>LIKE</literal> or any other type of condition.
</para>
<para>
When estimating with functional dependencies, the planner assumes that
conditions on the involved columns are compatible and hence redundant.
If they are incompatible, the correct estimate would be zero rows, but
that possibility is not considered. For example, given a query like
<programlisting>
SELECT * FROM zipcodes WHERE city = 'San