Home Explore Blog CI



postgresql

29th chunk of `doc/src/sgml/ref/pgbench.sgml`
7f3e142f66560262fd870f39420b06ae4dde3c730d2a24bc0000000100000fa0
 the rows chosen may be correlated with other data such as IDs from
      a sequence or the physical row ordering, which may skew performance
      measurements.
    </para>
    <para>
      To avoid this, you may wish to use the <function>permute</function>
      function, or some other additional step with similar effect, to shuffle
      the selected rows and remove such correlations.
    </para>
   </note>

  <para>
    Hash functions <literal>hash</literal>, <literal>hash_murmur2</literal> and
    <literal>hash_fnv1a</literal> accept an input value and an optional seed parameter.
    In case the seed isn't provided the value of <literal>:default_seed</literal>
    is used, which is initialized randomly unless set by the command-line
    <literal>-D</literal> option.
  </para>

  <para>
    <literal>permute</literal> accepts an input value, a size, and an optional
    seed parameter.  It generates a pseudorandom permutation of integers in
    the range <literal>[0, size)</literal>, and returns the index of the input
    value in the permuted values.  The permutation chosen is parameterized by
    the seed, which defaults to <literal>:default_seed</literal>, if not
    specified.  Unlike the hash functions, <literal>permute</literal> ensures
    that there are no collisions or holes in the output values.  Input values
    outside the interval are interpreted modulo the size.  The function raises
    an error if the size is not positive.  <function>permute</function> can be
    used to scatter the distribution of non-uniform random functions such as
    <literal>random_zipfian</literal> or <literal>random_exponential</literal>
    so that values drawn more often are not trivially correlated.  For
    instance, the following <application>pgbench</application> script
    simulates a possible real world workload typical for social media and
    blogging platforms where a few accounts generate excessive load:

<programlisting>
\set size 1000000
\set r random_zipfian(1, :size, 1.07)
\set k 1 + permute(:r, :size)
</programlisting>

    In some cases several distinct distributions are needed which don't correlate
    with each other and this is when the optional seed parameter comes in handy:

<programlisting>
\set k1 1 + permute(:r, :size, :default_seed + 123)
\set k2 1 + permute(:r, :size, :default_seed + 321)
</programlisting>

    A similar behavior can also be approximated with <function>hash</function>:

<programlisting>
\set size 1000000
\set r random_zipfian(1, 100 * :size, 1.07)
\set k 1 + abs(hash(:r)) % :size
</programlisting>

    However, since <function>hash</function> generates collisions, some values
    will not be reachable and others will be more frequent than expected from
    the original distribution.
  </para>

  <para>
   As an example, the full definition of the built-in TPC-B-like
   transaction is:

<programlisting>
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
</programlisting>

   This script allows each iteration of the transaction to reference
   different, randomly-chosen rows.  (This example also shows why it's
   important for each client session to have its own variables &mdash;
   otherwise they'd not be independently touching different rows.)
  </para>

 </refsect2>

 <refsect2>
  <title>Per-Transaction Logging</title>

  <para>
   With the <option>-l</option> option (but without
   the <option>--aggregate-interval</option> option),
   <application>pgbench</application> writes information about each transaction
   to

Title: Using Permute and Hash Functions to Avoid Data Correlation in pgbench
Summary
This section discusses how to prevent performance measurement skew in pgbench due to correlations between selected rows and other data like IDs. It introduces the `permute` function for shuffling rows to remove correlations, detailing its parameters and usage, and the advantages over using hash functions due to avoiding collisions. It provides examples of how to use permute with random distributions like random_zipfian to simulate real-world workloads and how to generate distinct, uncorrelated distributions using different seeds. Furthermore, it illustrates the full definition of a TPC-B-like transaction script in pgbench that references randomly chosen rows using variables.