Query Normalization and Hashing in pg_stat

appearing in the query. </para> <note> <para> The following details about constant replacement and <structfield>queryid</structfield> only apply when <xref linkend="guc-compute-query-id"/> is enabled. If you use an external module instead to compute <structfield>queryid</structfield>, you should refer to its documentation for details. </para> </note> <para> When a constant's value has been ignored for purposes of matching the query to other queries, the constant is replaced by a parameter symbol, such as <literal>$1</literal>, in the <structname>pg_stat_statements</structname> display. The rest of the query text is that of the first query that had the particular <structfield>queryid</structfield> hash value associated with the <structname>pg_stat_statements</structname> entry. </para> <para> Queries on which normalization can be applied may be observed with constant values in <structname>pg_stat_statements</structname>, especially when there is a high rate of entry deallocations. To reduce the likelihood of this happening, consider increasing <varname>pg_stat_statements.max</varname>. The <structname>pg_stat_statements_info</structname> view, discussed below in <xref linkend="pgstatstatements-pg-stat-statements-info"/>, provides statistics about entry deallocations. </para> <para> In some cases, queries with visibly different texts might get merged into a single <structname>pg_stat_statements</structname> entry; as explained above, this is expected to happen for semantically equivalent queries. In addition, if the only difference between queries is the number of elements in a list of constants, the list will get squashed down to a single element but shown with a commented-out list indicator: <screen> =# SELECT pg_stat_statements_reset(); =# SELECT * FROM test WHERE a IN (1, 2, 3, 4, 5, 6, 7); =# SELECT * FROM test WHERE a IN (1, 2, 3, 4, 5, 6, 7, 8); =# SELECT query, calls FROM pg_stat_statements WHERE query LIKE 'SELECT%'; -[ RECORD 1 ]------------------------------ query | SELECT * FROM test WHERE a IN ($1 /*, ... */) calls | 2 </screen> In addition to these cases, there is a small chance of hash collisions causing unrelated queries to be merged into one entry. (This cannot happen for queries belonging to different users or databases, however.) </para> <para> Since the <structfield>queryid</structfield> hash value is computed on the post-parse-analysis representation of the queries, the opposite is also possible: queries with identical texts might appear as separate entries, if they have different meanings as a result of factors such as different <varname>search_path</varname> settings. </para> <para> Consumers of <structname>pg_stat_statements</structname> may wish to use <structfield>queryid</structfield> (perhaps in combination with <structfield>dbid</structfield> and <structfield>userid</structfield>) as a more stable and reliable identifier for each entry than its query text. However, it is important to understand that there are only limited guarantees around the stability of the <structfield>queryid</structfield> hash value. Since the identifier is derived from the post-parse-analysis tree, its value is a function of, among other things, the internal object identifiers appearing in this representation. This has some counterintuitive implications. For example, <filename>pg_stat_statements</filename> will consider two apparently-identical queries to be distinct, if they reference for example a function that was dropped and recreated between the executions of the two queries. Conversely, if a table is dropped and recreated between the executions of queries, two apparently-identical queries may be considered the same. However, if the alias for a table is different for otherwise-similar queries, these queries will be considered

The pg_stat_statements view normalizes and hashes queries to group semantically equivalent queries together, replacing constants with parameter symbols and considering factors like search path settings, but this process is not foolproof and can lead to unexpected results due to hash collisions, object identifier changes, and other factors, making queryid a more stable identifier than query text.