Ordered-Set Aggregates and Partial Aggregation

CREATE FUNCTION ordered_set_transition(internal, anyelement) RETURNS internal ...; CREATE FUNCTION percentile_disc_final(internal, float8, anyelement) RETURNS anyelement ...; CREATE AGGREGATE percentile_disc (float8 ORDER BY anyelement) ( sfunc = ordered_set_transition, stype = internal, finalfunc = percentile_disc_final, finalfunc_extra ); </programlisting> This aggregate takes a <type>float8</type> direct argument (the percentile fraction) and an aggregated input that can be of any sortable data type. It could be used to obtain a median household income like this: <programlisting> SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY income) FROM households; percentile_disc ----------------- 50489 </programlisting> Here, <literal>0.5</literal> is a direct argument; it would make no sense for the percentile fraction to be a value varying across rows. </para> <para> Unlike the case for normal aggregates, the sorting of input rows for an ordered-set aggregate is <emphasis>not</emphasis> done behind the scenes, but is the responsibility of the aggregate's support functions. The typical implementation approach is to keep a reference to a <quote>tuplesort</quote> object in the aggregate's state value, feed the incoming rows into that object, and then complete the sorting and read out the data in the final function. This design allows the final function to perform special operations such as injecting additional <quote>hypothetical</quote> rows into the data to be sorted. While normal aggregates can often be implemented with support functions written in <application>PL/pgSQL</application> or another PL language, ordered-set aggregates generally have to be written in C, since their state values aren't definable as any SQL data type. (In the above example, notice that the state value is declared as type <type>internal</type> — this is typical.) Also, because the final function performs the sort, it is not possible to continue adding input rows by executing the transition function again later. This means the final function is not <literal>READ_ONLY</literal>; it must be declared in <link linkend="sql-createaggregate"><command>CREATE AGGREGATE</command></link> as <literal>READ_WRITE</literal>, or as <literal>SHAREABLE</literal> if it's possible for additional final-function calls to make use of the already-sorted state. </para> <para> The state transition function for an ordered-set aggregate receives the current state value plus the aggregated input values for each row, and returns the updated state value. This is the same definition as for normal aggregates, but note that the direct arguments (if any) are not provided. The final function receives the last state value, the values of the direct arguments if any, and (if <literal>finalfunc_extra</literal> is specified) null values corresponding to the aggregated input(s). As with normal aggregates, <literal>finalfunc_extra</literal> is only really useful if the aggregate is polymorphic; then the extra dummy argument(s) are needed to connect the final function's result type to the aggregate's input type(s). </para> <para> Currently, ordered-set aggregates cannot be used as window functions, and therefore there is no need for them to support moving-aggregate mode. </para> </sect2> <sect2 id="xaggr-partial-aggregates"> <title>Partial Aggregation</title> <indexterm> <primary>aggregate function</primary> <secondary>partial aggregation</secondary> </indexterm> <para> Optionally, an aggregate function can support <firstterm>partial aggregation</firstterm>. The idea of partial aggregation is to run the aggregate's state transition function over different subsets of the input data independently, and then to combine the state values resulting from those subsets to produce the same

Ordered-set aggregates require manual sorting of input rows and are typically implemented in C, with the final function performing the sort and possibly injecting hypothetical rows, and they cannot be used as window functions, whereas partial aggregation allows an aggregate function to run over different subsets of the input data independently and then combine the state values to produce the same result as running the aggregate over the entire dataset