Limitations of Functional Dependencies and N-Distinct Counts

produce an underestimate. </para> <sect4 id="planner-stats-extended-functional-deps-limits"> <title>Limitations of Functional Dependencies</title> <para> Functional dependencies are currently only applied when considering simple equality conditions that compare columns to constant values, and <literal>IN</literal> clauses with constant values. They are not used to improve estimates for equality conditions comparing two columns or comparing a column to an expression, nor for range clauses, <literal>LIKE</literal> or any other type of condition. </para> <para> When estimating with functional dependencies, the planner assumes that conditions on the involved columns are compatible and hence redundant. If they are incompatible, the correct estimate would be zero rows, but that possibility is not considered. For example, given a query like <programlisting> SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105'; </programlisting> the planner will disregard the <structfield>city</structfield> clause as not changing the selectivity, which is correct. However, it will make the same assumption about <programlisting> SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210'; </programlisting> even though there will really be zero rows satisfying this query. Functional dependency statistics do not provide enough information to conclude that, however. </para> <para> In many practical situations, this assumption is usually satisfied; for example, there might be a GUI in the application that only allows selecting compatible city and ZIP code values to use in a query. But if that's not the case, functional dependencies may not be a viable option. </para> </sect4> </sect3> <sect3 id="planner-stats-extended-n-distinct-counts"> <title>Multivariate N-Distinct Counts</title> <para> Single-column statistics store the number of distinct values in each column. Estimates of the number of distinct values when combining more than one column (for example, for <literal>GROUP BY a, b</literal>) are frequently wrong when the planner only has single-column statistical data, causing it to select bad plans. </para> <para> To improve such estimates, <command>ANALYZE</command> can collect n-distinct statistics for groups of columns. As before, it's impractical to do this for every possible column grouping, so data is collected only for those groups of columns appearing together in a statistics object defined with the <literal>ndistinct</literal> option. Data will be collected for each possible combination of two or more columns from the set of listed columns. </para> <para> Continuing the previous example, the n-distinct counts in a table of ZIP codes might look like the following: <programlisting> CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes; ANALYZE zipcodes; SELECT stxkeys AS k, stxdndistinct AS nd FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts2'; -[ RECORD 1 ]------------------------------------------------------&zwsp;-- k | 1 2 5 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178} (1 row) </programlisting> This indicates that there are three combinations of columns that have 33178 distinct values: ZIP code and state; ZIP code and city; and ZIP code, city and state (the fact that they are all equal is expected given that ZIP code alone is unique in this table). On the other hand, the combination of city and state has only 27435 distinct values. </para> <para> It's advisable to create <literal>ndistinct</literal> statistics objects only on combinations of columns that are actually used for grouping, and for

Functional dependencies in PostgreSQL have limitations, such as only applying to simple equality conditions and not accounting for incompatible conditions, which can lead to incorrect estimates, and to address this, multivariate n-distinct counts can be collected using the ANALYZE command to improve estimates for grouping operations, but it's essential to create statistics objects only for relevant column combinations to avoid unnecessary overhead.