Controlling the Query Planner with Explicit JOIN Clauses

0.000113 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05 ... (99 rows) </programlisting> This indicates that the most common combination of city and state is Washington in DC, with actual frequency (in the sample) about 0.35%. The base frequency of the combination (as computed from the simple per-column frequencies) is only 0.0027%, resulting in two orders of magnitude under-estimates. </para> <para> It's advisable to create <acronym>MCV</acronym> statistics objects only on combinations of columns that are actually used in conditions together, and for which misestimation of the number of groups is resulting in bad plans. Otherwise, the <command>ANALYZE</command> and planning cycles are just wasted. </para> </sect3> </sect2> </sect1> <sect1 id="explicit-joins"> <title>Controlling the Planner with Explicit <literal>JOIN</literal> Clauses</title> <indexterm zone="explicit-joins"> <primary>join</primary> <secondary>controlling the order</secondary> </indexterm> <para> It is possible to control the query planner to some extent by using the explicit <literal>JOIN</literal> syntax. To see why this matters, we first need some background. </para> <para> In a simple join query, such as: <programlisting> SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id; </programlisting> the planner is free to join the given tables in any order. For example, it could generate a query plan that joins A to B, using the <literal>WHERE</literal> condition <literal>a.id = b.id</literal>, and then joins C to this joined table, using the other <literal>WHERE</literal> condition. Or it could join B to C and then join A to that result. Or it could join A to C and then join them with B — but that would be inefficient, since the full Cartesian product of A and C would have to be formed, there being no applicable condition in the <literal>WHERE</literal> clause to allow optimization of the join. (All joins in the <productname>PostgreSQL</productname> executor happen between two input tables, so it's necessary to build up the result in one or another of these fashions.) The important point is that these different join possibilities give semantically equivalent results but might have hugely different execution costs. Therefore, the planner will explore all of them to try to find the most efficient query plan. </para> <para> When a query only involves two or three tables, there aren't many join orders to worry about. But the number of possible join orders grows exponentially as the number of tables expands. Beyond ten or so input tables it's no longer practical to do an exhaustive search of all the possibilities, and even for six or seven tables planning might take an annoyingly long time. When there are too many input tables, the <productname>PostgreSQL</productname> planner will switch from exhaustive search to a <firstterm>genetic</firstterm> probabilistic search through a limited number of possibilities. (The switch-over threshold is set by the <xref linkend="guc-geqo-threshold"/> run-time parameter.) The genetic search takes less time, but it won't necessarily find the best possible plan. </para> <para> When the query involves outer joins, the planner has less freedom than it does for plain (inner) joins. For example, consider: <programlisting> SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id); </programlisting> Although this query's restrictions are

The PostgreSQL query planner can be controlled to some extent by using explicit JOIN syntax, which allows specifying the order of joins and can improve performance by reducing the number of possible join orders, especially for queries involving multiple tables, and can also handle outer joins with less freedom than inner joins.