id="planner-stats-extended">
<title>Extended Statistics</title>
<indexterm zone="planner-stats-extended">
<primary>statistics</primary>
<secondary>of the planner</secondary>
</indexterm>
<indexterm>
<primary>correlation</primary>
<secondary>in the query planner</secondary>
</indexterm>
<indexterm>
<primary>pg_statistic_ext</primary>
</indexterm>
<indexterm>
<primary>pg_statistic_ext_data</primary>
</indexterm>
<para>
It is common to see slow queries running bad execution plans because
multiple columns used in the query clauses are correlated.
The planner normally assumes that multiple conditions
are independent of each other,
an assumption that does not hold when column values are correlated.
Regular statistics, because of their per-individual-column nature,
cannot capture any knowledge about cross-column correlation.
However, <productname>PostgreSQL</productname> has the ability to compute
<firstterm>multivariate statistics</firstterm>, which can capture
such information.
</para>
<para>
Because the number of possible column combinations is very large,
it's impractical to compute multivariate statistics automatically.
Instead, <firstterm>extended statistics objects</firstterm>, more often
called just <firstterm>statistics objects</firstterm>, can be created to instruct
the server to obtain statistics across interesting sets of columns.
</para>
<para>
Statistics objects are created using the
<link linkend="sql-createstatistics"><command>CREATE STATISTICS</command></link> command.
Creation of such an object merely creates a catalog entry expressing
interest in the statistics. Actual data collection is performed
by <command>ANALYZE</command> (either a manual command, or background
auto-analyze). The collected values can be examined in the
<link linkend="catalog-pg-statistic-ext-data"><structname>pg_statistic_ext_data</structname></link>
catalog.
</para>
<para>
<command>ANALYZE</command> computes extended statistics based on the same
sample of table rows that it takes for computing regular single-column
statistics. Since the sample size is increased by increasing the
statistics target for the table or any of its columns (as described in
the previous section), a larger statistics target will normally result in
more accurate extended statistics, as well as more time spent calculating
them.
</para>
<para>
The following subsections describe the kinds of extended statistics
that are currently supported.
</para>
<sect3 id="planner-stats-extended-functional-deps">
<title>Functional Dependencies</title>
<para>
The simplest kind of extended statistics tracks <firstterm>functional
dependencies</firstterm>, a concept used in definitions of database normal forms.
We say that column <structfield>b</structfield> is functionally dependent on
column <structfield>a</structfield> if knowledge of the value of
<structfield>a</structfield> is sufficient to determine the value
of <structfield>b</structfield>, that is there are no two rows having the same value
of <structfield>a</structfield> but different values of <structfield>b</structfield>.
In a fully normalized database, functional dependencies should exist
only on primary keys and superkeys. However, in practice many data sets
are not fully normalized for various reasons; intentional
denormalization for performance reasons is a common example.
Even in a fully normalized database, there may be partial correlation
between some columns, which can be expressed as partial functional
dependency.
</para>
<para>
The existence of functional dependencies directly affects the accuracy
of estimates in certain queries. If a query contains conditions on
both the independent and the dependent