Detailed Explanation of Gaussian and Zipfian Distributions in pgbench

the standard normal distribution, with mean <literal>mu</literal> defined as <literal>(max + min) / 2.0</literal>, with <literallayout> f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) / (2.0 * PHI(parameter) - 1) </literallayout> then value <replaceable>i</replaceable> between <replaceable>min</replaceable> and <replaceable>max</replaceable> inclusive is drawn with probability: <literal>f(i + 0.5) - f(i - 0.5)</literal>. Intuitively, the larger the <replaceable>parameter</replaceable>, the more frequently values close to the middle of the interval are drawn, and the less frequently values close to the <replaceable>min</replaceable> and <replaceable>max</replaceable> bounds. About 67% of values are drawn from the middle <literal>1.0 / parameter</literal>, that is a relative <literal>0.5 / parameter</literal> around the mean, and 95% in the middle <literal>2.0 / parameter</literal>, that is a relative <literal>1.0 / parameter</literal> around the mean; for instance, if <replaceable>parameter</replaceable> is 4.0, 67% of values are drawn from the middle quarter (1.0 / 4.0) of the interval (i.e., from <literal>3.0 / 8.0</literal> to <literal>5.0 / 8.0</literal>) and 95% from the middle half (<literal>2.0 / 4.0</literal>) of the interval (second and third quartiles). The minimum allowed <replaceable>parameter</replaceable> value is 2.0. </para> </listitem> <listitem> <para> <literal>random_zipfian</literal> generates a bounded Zipfian distribution. <replaceable>parameter</replaceable> defines how skewed the distribution is. The larger the <replaceable>parameter</replaceable>, the more frequently values closer to the beginning of the interval are drawn. The distribution is such that, assuming the range starts from 1, the ratio of the probability of drawing <replaceable>k</replaceable> versus drawing <replaceable>k+1</replaceable> is <literal>((<replaceable>k</replaceable>+1)/<replaceable>k</replaceable>)**<replaceable>parameter</replaceable></literal>. For example, <literal>random_zipfian(1, ..., 2.5)</literal> produces the value <literal>1</literal> about <literal>(2/1)**2.5 = 5.66</literal> times more frequently than <literal>2</literal>, which itself is produced <literal>(3/2)**2.5 = 2.76</literal> times more frequently than <literal>3</literal>, and so on. </para> <para> <application>pgbench</application>'s implementation is based on "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551, Springer 1986. Due to limitations of that algorithm, the <replaceable>parameter</replaceable> value is restricted to the range [1.001, 1000]. </para> </listitem> </itemizedlist> <note> <para> When designing a benchmark which selects rows non-uniformly, be aware that the rows chosen may be correlated with other data such as IDs from a sequence or the physical row ordering, which may skew performance measurements. </para> <para> To avoid this, you may wish to use the <function>permute</function> function, or some other additional step with similar effect, to shuffle the selected rows and remove such correlations. </para> </note> <para> Hash functions <literal>hash</literal>, <literal>hash_murmur2</literal> and <literal>hash_fnv1a</literal> accept an input value and an optional seed parameter. In case the seed isn't provided the value of <literal>:default_seed</literal> is used, which is initialized randomly unless set by the command-line <literal>-D</literal> option. </para> <para> <literal>permute</literal> accepts an input value, a size, and an optional seed parameter. It generates a pseudorandom permutation of integers in the range <literal>[0, size)</literal>,

This section provides a more in-depth explanation of the Gaussian and Zipfian distributions used in pgbench's random number generation. It includes the mathematical formulas for each distribution and explains how the parameter affects the distribution's shape. For Gaussian, it explains how values are clustered around the mean based on the parameter. For Zipfian, it details the relationship between the parameter and the probability of drawing values closer to the beginning of the interval and notes the limitations of the algorithm used. It also includes a note on designing benchmarks to avoid correlations with other data and introduces hash and permute functions.