Home Explore Blog CI



postgresql

22th chunk of `doc/src/sgml/perform.sgml`
294bc3aa5507e566b8e2e26ea947b1b0ff3dbaec9adb9ef20000000100000fa5
 id="planner-stats-extended">
   <title>Extended Statistics</title>

   <indexterm zone="planner-stats-extended">
    <primary>statistics</primary>
    <secondary>of the planner</secondary>
   </indexterm>

   <indexterm>
    <primary>correlation</primary>
    <secondary>in the query planner</secondary>
   </indexterm>

   <indexterm>
    <primary>pg_statistic_ext</primary>
   </indexterm>

   <indexterm>
    <primary>pg_statistic_ext_data</primary>
   </indexterm>

   <para>
    It is common to see slow queries running bad execution plans because
    multiple columns used in the query clauses are correlated.
    The planner normally assumes that multiple conditions
    are independent of each other,
    an assumption that does not hold when column values are correlated.
    Regular statistics, because of their per-individual-column nature,
    cannot capture any knowledge about cross-column correlation.
    However, <productname>PostgreSQL</productname> has the ability to compute
    <firstterm>multivariate statistics</firstterm>, which can capture
    such information.
   </para>

   <para>
    Because the number of possible column combinations is very large,
    it's impractical to compute multivariate statistics automatically.
    Instead, <firstterm>extended statistics objects</firstterm>, more often
    called just <firstterm>statistics objects</firstterm>, can be created to instruct
    the server to obtain statistics across interesting sets of columns.
   </para>

   <para>
    Statistics objects are created using the
    <link linkend="sql-createstatistics"><command>CREATE STATISTICS</command></link> command.
    Creation of such an object merely creates a catalog entry expressing
    interest in the statistics.  Actual data collection is performed
    by <command>ANALYZE</command> (either a manual command, or background
    auto-analyze).  The collected values can be examined in the
    <link linkend="catalog-pg-statistic-ext-data"><structname>pg_statistic_ext_data</structname></link>
    catalog.
   </para>

   <para>
    <command>ANALYZE</command> computes extended statistics based on the same
    sample of table rows that it takes for computing regular single-column
    statistics.  Since the sample size is increased by increasing the
    statistics target for the table or any of its columns (as described in
    the previous section), a larger statistics target will normally result in
    more accurate extended statistics, as well as more time spent calculating
    them.
   </para>

   <para>
    The following subsections describe the kinds of extended statistics
    that are currently supported.
   </para>

   <sect3 id="planner-stats-extended-functional-deps">
    <title>Functional Dependencies</title>

    <para>
     The simplest kind of extended statistics tracks <firstterm>functional
     dependencies</firstterm>, a concept used in definitions of database normal forms.
     We say that column <structfield>b</structfield> is functionally dependent on
     column <structfield>a</structfield> if knowledge of the value of
     <structfield>a</structfield> is sufficient to determine the value
     of <structfield>b</structfield>, that is there are no two rows having the same value
     of <structfield>a</structfield> but different values of <structfield>b</structfield>.
     In a fully normalized database, functional dependencies should exist
     only on primary keys and superkeys. However, in practice many data sets
     are not fully normalized for various reasons; intentional
     denormalization for performance reasons is a common example.
     Even in a fully normalized database, there may be partial correlation
     between some columns, which can be expressed as partial functional
     dependency.
    </para>

    <para>
     The existence of functional dependencies directly affects the accuracy
     of estimates in certain queries.  If a query contains conditions on
     both the independent and the dependent

Title: Extended Statistics in PostgreSQL
Summary
PostgreSQL supports extended statistics to capture cross-column correlation, which can improve query execution plans by accounting for dependencies between columns, and can be created using the CREATE STATISTICS command, with data collection performed by ANALYZE and stored in the pg_statistic_ext_data catalog.