Using Permute and Hash Functions to Avoid Data Correlation in pgbench

the rows chosen may be correlated with other data such as IDs from a sequence or the physical row ordering, which may skew performance measurements. </para> <para> To avoid this, you may wish to use the <function>permute</function> function, or some other additional step with similar effect, to shuffle the selected rows and remove such correlations. </para> </note> <para> Hash functions <literal>hash</literal>, <literal>hash_murmur2</literal> and <literal>hash_fnv1a</literal> accept an input value and an optional seed parameter. In case the seed isn't provided the value of <literal>:default_seed</literal> is used, which is initialized randomly unless set by the command-line <literal>-D</literal> option. </para> <para> <literal>permute</literal> accepts an input value, a size, and an optional seed parameter. It generates a pseudorandom permutation of integers in the range <literal>[0, size)</literal>, and returns the index of the input value in the permuted values. The permutation chosen is parameterized by the seed, which defaults to <literal>:default_seed</literal>, if not specified. Unlike the hash functions, <literal>permute</literal> ensures that there are no collisions or holes in the output values. Input values outside the interval are interpreted modulo the size. The function raises an error if the size is not positive. <function>permute</function> can be used to scatter the distribution of non-uniform random functions such as <literal>random_zipfian</literal> or <literal>random_exponential</literal> so that values drawn more often are not trivially correlated. For instance, the following <application>pgbench</application> script simulates a possible real world workload typical for social media and blogging platforms where a few accounts generate excessive load: <programlisting> \set size 1000000 \set r random_zipfian(1, :size, 1.07) \set k 1 + permute(:r, :size) </programlisting> In some cases several distinct distributions are needed which don't correlate with each other and this is when the optional seed parameter comes in handy: <programlisting> \set k1 1 + permute(:r, :size, :default_seed + 123) \set k2 1 + permute(:r, :size, :default_seed + 321) </programlisting> A similar behavior can also be approximated with <function>hash</function>: <programlisting> \set size 1000000 \set r random_zipfian(1, 100 * :size, 1.07) \set k 1 + abs(hash(:r)) % :size </programlisting> However, since <function>hash</function> generates collisions, some values will not be reachable and others will be more frequent than expected from the original distribution. </para> <para> As an example, the full definition of the built-in TPC-B-like transaction is: <programlisting> \set aid random(1, 100000 * :scale) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; </programlisting> This script allows each iteration of the transaction to reference different, randomly-chosen rows. (This example also shows why it's important for each client session to have its own variables — otherwise they'd not be independently touching different rows.) </para> </refsect2> <refsect2> <title>Per-Transaction Logging</title> <para> With the <option>-l</option> option (but without the <option>--aggregate-interval</option> option), <application>pgbench</application> writes information about each transaction to

This section discusses how to prevent performance measurement skew in pgbench due to correlations between selected rows and other data like IDs. It introduces the `permute` function for shuffling rows to remove correlations, detailing its parameters and usage, and the advantages over using hash functions due to avoiding collisions. It provides examples of how to use permute with random distributions like random_zipfian to simulate real-world workloads and how to generate distinct, uncorrelated distributions using different seeds. Furthermore, it illustrates the full definition of a TPC-B-like transaction script in pgbench that references randomly chosen rows using variables.