Details and Usage of Unicode Escape Syntax in PostgreSQL

<indexterm zone="sql-syntax-strings-uescape"> <primary>Unicode escape</primary> <secondary>in string constants</secondary> </indexterm> <para> <productname>PostgreSQL</productname> also supports another type of escape syntax for strings that allows specifying arbitrary Unicode characters by code point. A Unicode escape string constant starts with <literal>U&</literal> (upper or lower case letter U followed by ampersand) immediately before the opening quote, without any spaces in between, for example <literal>U&'foo'</literal>. (Note that this creates an ambiguity with the operator <literal>&</literal>. Use spaces around the operator to avoid this problem.) Inside the quotes, Unicode characters can be specified in escaped form by writing a backslash followed by the four-digit hexadecimal code point number or alternatively a backslash followed by a plus sign followed by a six-digit hexadecimal code point number. For example, the string <literal>'data'</literal> could be written as <programlisting> U&'d\0061t\+000061' </programlisting> The following less trivial example writes the Russian word <quote>slon</quote> (elephant) in Cyrillic letters: <programlisting> U&'\0441\043B\043E\043D' </programlisting> </para> <para> If a different escape character than backslash is desired, it can be specified using the <literal>UESCAPE</literal><indexterm><primary>UESCAPE</primary></indexterm> clause after the string, for example: <programlisting> U&'d!0061t!+000061' UESCAPE '!' </programlisting> The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character. </para> <para> To include the escape character in the string literally, write it twice. </para> <para> Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate pairs are not stored directly, but are combined into a single code point.) </para> <para> If the server encoding is not UTF-8, the Unicode code point identified by one of these escape sequences is converted to the actual server encoding; an error is reported if that's not possible. </para> <para> Also, the Unicode escape syntax for string constants only works when the configuration parameter <xref linkend="guc-standard-conforming-strings"/> is turned on. This is because otherwise this syntax could confuse clients that parse the SQL statements to the point that it could lead to SQL injections and similar security issues. If the parameter is set to off, this syntax will be rejected with an error message. </para> </sect3> <sect3 id="sql-syntax-dollar-quoting"> <title>Dollar-Quoted String Constants</title> <indexterm> <primary>dollar quoting</primary> </indexterm> <para> While the standard syntax for specifying string constants is usually convenient, it can be difficult to understand when the desired string contains many single quotes, since each of those must be doubled. To allow more readable queries in such situations, <productname>PostgreSQL</productname> provides another way, called <quote>dollar quoting</quote>, to write string constants. A dollar-quoted string constant consists of a dollar sign (<literal>$</literal>), an optional <quote>tag</quote> of zero or more characters, another dollar sign, an arbitrary sequence of characters that makes up the string content, a dollar sign, the same tag that began this dollar quote, and a dollar sign. For example, here are

This section elaborates on the Unicode escape syntax in PostgreSQL, explaining how to specify Unicode characters using 4 or 6-digit hexadecimal code points, and how to use a custom escape character with the UESCAPE clause. It covers handling the escape character literally, using surrogate pairs for characters beyond U+FFFF, and the conversion to the server encoding. It emphasizes that this syntax is only enabled when standard_conforming_strings is turned on to prevent SQL injection risks. Furthermore, it introduces dollar-quoted string constants as an alternative to standard syntax, especially useful when dealing with many single quotes.