Home Explore Blog CI



postgresql

28th chunk of `doc/src/sgml/ref/pgbench.sgml`
730deffe3795fbb02099a6006b11504d0f9a9fd44571e3d40000000100000fa6
 the standard normal distribution, with mean <literal>mu</literal>
      defined as <literal>(max + min) / 2.0</literal>, with
<literallayout>
f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
       (2.0 * PHI(parameter) - 1)
</literallayout>
      then value <replaceable>i</replaceable> between <replaceable>min</replaceable> and
      <replaceable>max</replaceable> inclusive is drawn with probability:
      <literal>f(i + 0.5) - f(i - 0.5)</literal>.
      Intuitively, the larger the <replaceable>parameter</replaceable>, the more
      frequently values close to the middle of the interval are drawn, and the
      less frequently values close to the <replaceable>min</replaceable> and
      <replaceable>max</replaceable> bounds. About 67% of values are drawn from the
      middle <literal>1.0 / parameter</literal>, that is a relative
      <literal>0.5 / parameter</literal> around the mean, and 95% in the middle
      <literal>2.0 / parameter</literal>, that is a relative
      <literal>1.0 / parameter</literal> around the mean; for instance, if
      <replaceable>parameter</replaceable> is 4.0, 67% of values are drawn from the
      middle quarter (1.0 / 4.0) of the interval (i.e., from
      <literal>3.0 / 8.0</literal> to <literal>5.0 / 8.0</literal>) and 95% from
      the middle half (<literal>2.0 / 4.0</literal>) of the interval (second and third
      quartiles). The minimum allowed <replaceable>parameter</replaceable>
      value is 2.0.
     </para>
    </listitem>
    <listitem>
     <para>
      <literal>random_zipfian</literal> generates a bounded Zipfian
      distribution.
      <replaceable>parameter</replaceable> defines how skewed the distribution
      is. The larger the <replaceable>parameter</replaceable>, the more
      frequently values closer to the beginning of the interval are drawn.
      The distribution is such that, assuming the range starts from 1,
      the ratio of the probability of drawing <replaceable>k</replaceable>
      versus drawing <replaceable>k+1</replaceable> is
      <literal>((<replaceable>k</replaceable>+1)/<replaceable>k</replaceable>)**<replaceable>parameter</replaceable></literal>.
      For example, <literal>random_zipfian(1, ..., 2.5)</literal> produces
      the value <literal>1</literal> about <literal>(2/1)**2.5 =
      5.66</literal> times more frequently than <literal>2</literal>, which
      itself is produced <literal>(3/2)**2.5 = 2.76</literal> times more
      frequently than <literal>3</literal>, and so on.
     </para>
     <para>
      <application>pgbench</application>'s implementation is based on
      "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551,
      Springer 1986.  Due to limitations of that algorithm,
      the <replaceable>parameter</replaceable> value is restricted to
      the range [1.001, 1000].
     </para>
    </listitem>
   </itemizedlist>

   <note>
    <para>
      When designing a benchmark which selects rows non-uniformly, be aware
      that the rows chosen may be correlated with other data such as IDs from
      a sequence or the physical row ordering, which may skew performance
      measurements.
    </para>
    <para>
      To avoid this, you may wish to use the <function>permute</function>
      function, or some other additional step with similar effect, to shuffle
      the selected rows and remove such correlations.
    </para>
   </note>

  <para>
    Hash functions <literal>hash</literal>, <literal>hash_murmur2</literal> and
    <literal>hash_fnv1a</literal> accept an input value and an optional seed parameter.
    In case the seed isn't provided the value of <literal>:default_seed</literal>
    is used, which is initialized randomly unless set by the command-line
    <literal>-D</literal> option.
  </para>

  <para>
    <literal>permute</literal> accepts an input value, a size, and an optional
    seed parameter.  It generates a pseudorandom permutation of integers in
    the range <literal>[0, size)</literal>,

Title: Detailed Explanation of Gaussian and Zipfian Distributions in pgbench
Summary
This section provides a more in-depth explanation of the Gaussian and Zipfian distributions used in pgbench's random number generation. It includes the mathematical formulas for each distribution and explains how the parameter affects the distribution's shape. For Gaussian, it explains how values are clustered around the mean based on the parameter. For Zipfian, it details the relationship between the parameter and the probability of drawing values closer to the beginning of the interval and notes the limitations of the algorithm used. It also includes a note on designing benchmarks to avoid correlations with other data and introduces hash and permute functions.