Functional Dependencies in PostgreSQL

functionally dependent on column <structfield>a</structfield> if knowledge of the value of <structfield>a</structfield> is sufficient to determine the value of <structfield>b</structfield>, that is there are no two rows having the same value of <structfield>a</structfield> but different values of <structfield>b</structfield>. In a fully normalized database, functional dependencies should exist only on primary keys and superkeys. However, in practice many data sets are not fully normalized for various reasons; intentional denormalization for performance reasons is a common example. Even in a fully normalized database, there may be partial correlation between some columns, which can be expressed as partial functional dependency. </para> <para> The existence of functional dependencies directly affects the accuracy of estimates in certain queries. If a query contains conditions on both the independent and the dependent column(s), the conditions on the dependent columns do not further reduce the result size; but without knowledge of the functional dependency, the query planner will assume that the conditions are independent, resulting in underestimating the result size. </para> <para> To inform the planner about functional dependencies, <command>ANALYZE</command> can collect measurements of cross-column dependency. Assessing the degree of dependency between all sets of columns would be prohibitively expensive, so data collection is limited to those groups of columns appearing together in a statistics object defined with the <literal>dependencies</literal> option. It is advisable to create <literal>dependencies</literal> statistics only for column groups that are strongly correlated, to avoid unnecessary overhead in both <command>ANALYZE</command> and later query planning. </para> <para> Here is an example of collecting functional-dependency statistics: <programlisting> CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes; ANALYZE zipcodes; SELECT stxname, stxkeys, stxddependencies FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts'; stxname | stxkeys | stxddependencies ---------+---------+------------------------------------------ stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130} (1 row) </programlisting> Here it can be seen that column 1 (zip code) fully determines column 5 (city) so the coefficient is 1.0, while city only determines zip code about 42% of the time, meaning that there are many cities (58%) that are represented by more than a single ZIP code. </para> <para> When computing the selectivity for a query involving functionally dependent columns, the planner adjusts the per-condition selectivity estimates using the dependency coefficients so as not to produce an underestimate. </para> <sect4 id="planner-stats-extended-functional-deps-limits"> <title>Limitations of Functional Dependencies</title> <para> Functional dependencies are currently only applied when considering simple equality conditions that compare columns to constant values, and <literal>IN</literal> clauses with constant values. They are not used to improve estimates for equality conditions comparing two columns or comparing a column to an expression, nor for range clauses, <literal>LIKE</literal> or any other type of condition. </para> <para> When estimating with functional dependencies, the planner assumes that conditions on the involved columns are compatible and hence redundant. If they are incompatible, the correct estimate would be zero rows, but that possibility is not considered. For example, given a query like <programlisting> SELECT * FROM zipcodes WHERE city = 'San

PostgreSQL supports functional dependency statistics to improve query estimates by accounting for dependencies between columns, which can be collected using the ANALYZE command and stored in statistics objects defined with the dependencies option, allowing the planner to adjust selectivity estimates and avoid underestimation.