How the Planner Uses Statistics

<chapter id="planner-stats-details"> <title>How the Planner Uses Statistics</title> <para> This chapter builds on the material covered in <xref linkend="using-explain"/> and <xref linkend="planner-stats"/> to show some additional details about how the planner uses the system statistics to estimate the number of rows each part of a query might return. This is a significant part of the planning process, providing much of the raw material for cost calculation. </para> <para> The intent of this chapter is not to document the code in detail, but to present an overview of how it works. This will perhaps ease the learning curve for someone who subsequently wishes to read the code. </para> <sect1 id="row-estimation-examples"> <title>Row Estimation Examples</title> <indexterm zone="row-estimation-examples"> <primary>row estimation</primary> <secondary>planner</secondary> </indexterm> <para> The examples shown below use tables in the <productname>PostgreSQL</productname> regression test database. Note also that since <command>ANALYZE</command> uses random sampling while producing statistics, the results will change slightly after any new <command>ANALYZE</command>. </para> <para> Let's start with a very simple query: <programlisting> EXPLAIN SELECT * FROM tenk1; QUERY PLAN ------------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244) </programlisting> How the planner determines the cardinality of <structname>tenk1</structname> is covered in <xref linkend="planner-stats"/>, but is repeated here for completeness. The number of pages and rows is looked up in <structname>pg_class</structname>: <programlisting> SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; relpages | reltuples ----------+----------- 358 | 10000 </programlisting> These numbers are current as of the last <command>VACUUM</command> or <command>ANALYZE</command> on the table. The planner then fetches the actual current number of pages in the table (this is a cheap operation, not requiring a table scan). If that is different from <structfield>relpages</structfield> then <structfield>reltuples</structfield> is scaled accordingly to arrive at a current number-of-rows estimate. In the example above, the value of <structfield>relpages</structfield> is up-to-date so the rows estimate is the same as <structfield>reltuples</structfield>. </para> <para> Let's move on to an example with a range condition in its <literal>WHERE</literal> clause: <programlisting> EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000; QUERY PLAN -------------------------------------------------------------------&zwsp;------------- Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244) Recheck Cond: (unique1 < 1000) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0) Index Cond: (unique1 < 1000) </programlisting> The planner examines the <literal>WHERE</literal> clause condition and looks up the selectivity function for the operator <literal><</literal> in <structname>pg_operator</structname>. This is held in the column <structfield>oprrest</structfield>, and the entry in this case is <function>scalarltsel</function>. The <function>scalarltsel</function> function retrieves the histogram for <structfield>unique1</structfield> from <structname>pg_statistic</structname>. For manual queries it is more convenient to look in the simpler <structname>pg_stats</structname> view: <programlisting> SELECT histogram_bounds FROM pg_stats WHERE tablename='tenk1' AND attname='unique1'; histogram_bounds ------------------------------------------------------ {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}

This chapter explains how the planner in PostgreSQL uses system statistics to estimate the number of rows returned by a query, including examples of row estimation and how the planner determines cardinality for simple and complex queries.