Parallel Query in PostgreSQL

<chapter id="parallel-query"> <title>Parallel Query</title> <indexterm zone="parallel-query"> <primary>parallel query</primary> </indexterm> <para> <productname>PostgreSQL</productname> can devise query plans that can leverage multiple CPUs in order to answer queries faster. This feature is known as parallel query. Many queries cannot benefit from parallel query, either due to limitations of the current implementation or because there is no imaginable query plan that is any faster than the serial query plan. However, for queries that can benefit, the speedup from parallel query is often very significant. Many queries can run more than twice as fast when using parallel query, and some queries can run four times faster or even more. Queries that touch a large amount of data but return only a few rows to the user will typically benefit most. This chapter explains some details of how parallel query works and in which situations it can be used so that users who wish to make use of it can understand what to expect. </para> <sect1 id="how-parallel-query-works"> <title>How Parallel Query Works</title> <para> When the optimizer determines that parallel query is the fastest execution strategy for a particular query, it will create a query plan that includes a <firstterm>Gather</firstterm> or <firstterm>Gather Merge</firstterm> node. Here is a simple example: <screen> EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%'; QUERY PLAN -------------------------------------------------------------------&zwsp;------------------ Gather (cost=1000.00..217018.43 rows=1 width=97) Workers Planned: 2 -> Parallel Seq Scan on pgbench_accounts (cost=0.00..216018.33 rows=1 width=97) Filter: (filler ~~ '%x%'::text) (4 rows) </screen> </para> <para> In all cases, the <literal>Gather</literal> or <literal>Gather Merge</literal> node will have exactly one child plan, which is the portion of the plan that will be executed in parallel. If the <literal>Gather</literal> or <literal>Gather Merge</literal> node is at the very top of the plan tree, then the entire query will execute in parallel. If it is somewhere else in the plan tree, then only the portion of the plan below it will run in parallel. In the example above, the query accesses only one table, so there is only one plan node other than the <literal>Gather</literal> node itself; since that plan node is a child of the <literal>Gather</literal> node, it will run in parallel. </para> <para> <link linkend="using-explain">Using EXPLAIN</link>, you can see the number of workers chosen by the planner. When the <literal>Gather</literal> node is reached during query execution, the process that is implementing the user's session will request a number of <link linkend="bgworker">background worker processes</link> equal to the number of workers chosen by the planner. The number of background workers that the planner will consider using is limited to at most <xref linkend="guc-max-parallel-workers-per-gather"/>. The total number of background workers that can exist at any one time is limited by both <xref linkend="guc-max-worker-processes"/> and <xref linkend="guc-max-parallel-workers"/>. Therefore, it is possible for a parallel query to run with fewer workers than planned, or even with no workers at all. The optimal plan may depend on the number of workers that are available, so this can result in poor query performance. If this occurrence is frequent, consider increasing <varname>max_worker_processes</varname> and <varname>max_parallel_workers</varname> so that more workers can be run simultaneously or alternatively reducing <varname>max_parallel_workers_per_gather</varname> so that the planner requests

PostgreSQL's parallel query feature allows certain queries to utilize multiple CPUs, significantly improving query speeds for tasks that involve large amounts of data, with some queries running up to four times faster than serial queries.