Extended Statistics in PostgreSQL

id="planner-stats-extended"> <title>Extended Statistics</title> <indexterm zone="planner-stats-extended"> <primary>statistics</primary> <secondary>of the planner</secondary> </indexterm> <indexterm> <primary>correlation</primary> <secondary>in the query planner</secondary> </indexterm> <indexterm> <primary>pg_statistic_ext</primary> </indexterm> <indexterm> <primary>pg_statistic_ext_data</primary> </indexterm> <para> It is common to see slow queries running bad execution plans because multiple columns used in the query clauses are correlated. The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated. Regular statistics, because of their per-individual-column nature, cannot capture any knowledge about cross-column correlation. However, <productname>PostgreSQL</productname> has the ability to compute <firstterm>multivariate statistics</firstterm>, which can capture such information. </para> <para> Because the number of possible column combinations is very large, it's impractical to compute multivariate statistics automatically. Instead, <firstterm>extended statistics objects</firstterm>, more often called just <firstterm>statistics objects</firstterm>, can be created to instruct the server to obtain statistics across interesting sets of columns. </para> <para> Statistics objects are created using the <link linkend="sql-createstatistics"><command>CREATE STATISTICS</command></link> command. Creation of such an object merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed by <command>ANALYZE</command> (either a manual command, or background auto-analyze). The collected values can be examined in the <link linkend="catalog-pg-statistic-ext-data"><structname>pg_statistic_ext_data</structname></link> catalog. </para> <para> <command>ANALYZE</command> computes extended statistics based on the same sample of table rows that it takes for computing regular single-column statistics. Since the sample size is increased by increasing the statistics target for the table or any of its columns (as described in the previous section), a larger statistics target will normally result in more accurate extended statistics, as well as more time spent calculating them. </para> <para> The following subsections describe the kinds of extended statistics that are currently supported. </para> <sect3 id="planner-stats-extended-functional-deps"> <title>Functional Dependencies</title> <para> The simplest kind of extended statistics tracks <firstterm>functional dependencies</firstterm>, a concept used in definitions of database normal forms. We say that column <structfield>b</structfield> is functionally dependent on column <structfield>a</structfield> if knowledge of the value of <structfield>a</structfield> is sufficient to determine the value of <structfield>b</structfield>, that is there are no two rows having the same value of <structfield>a</structfield> but different values of <structfield>b</structfield>. In a fully normalized database, functional dependencies should exist only on primary keys and superkeys. However, in practice many data sets are not fully normalized for various reasons; intentional denormalization for performance reasons is a common example. Even in a fully normalized database, there may be partial correlation between some columns, which can be expressed as partial functional dependency. </para> <para> The existence of functional dependencies directly affects the accuracy of estimates in certain queries. If a query contains conditions on both the independent and the dependent

PostgreSQL supports extended statistics to capture cross-column correlation, which can improve query execution plans by accounting for dependencies between columns, and can be created using the CREATE STATISTICS command, with data collection performed by ANALYZE and stored in the pg_statistic_ext_data catalog.