Customizing Collations with ICU

See also <ulink url="https://www.unicode.org/reports/tr10">Unicode Technical Standard 10</ulink> for more information on the terminology. </para> <para> To create a nondeterministic collation, specify the property <literal>deterministic = false</literal> to <command>CREATE COLLATION</command>, for example: <programlisting> CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false); </programlisting> This example would use the standard Unicode collation in a nondeterministic way. In particular, this would allow strings in different normal forms to be compared correctly. More interesting examples make use of the ICU customization facilities explained above. For example: <programlisting> CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); </programlisting> </para> <para> All standard and predefined collations are deterministic, all user-defined collations are deterministic by default. While nondeterministic collations give a more <quote>correct</quote> behavior, especially when considering the full power of Unicode and its many special cases, they also have some drawbacks. Foremost, their use leads to a performance penalty. Note, in particular, that B-tree cannot use deduplication with indexes that use a nondeterministic collation. Also, certain operations are not possible with nondeterministic collations, such as some pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted. </para> <tip> <para> To deal with text in different Unicode normalization forms, it is also an option to use the functions/expressions <function>normalize</function> and <literal>is normalized</literal> to preprocess or check the strings, instead of using nondeterministic collations. There are different trade-offs for each approach. </para> </tip> </sect3> </sect2> <sect2 id="icu-custom-collations"> <title>ICU Custom Collations</title> <para> ICU allows extensive control over collation behavior by defining new collations with collation settings as a part of the language tag. These settings can modify the collation order to suit a variety of needs. For instance: <programlisting> -- ignore differences in accents and case CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1'); SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true -- upper case letters sort before lower case. CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper'); SELECT 'B' < 'b' COLLATE upper_first; -- true -- treat digits numerically and ignore punctuation CREATE COLLATION num_ignore_punct (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-kn'); SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true </programlisting> Many of the available options are described in <xref linkend="icu-collation-settings"/>, or see <xref linkend="icu-external-references"/> for more details. </para> <sect3 id="icu-collation-comparison-levels"> <title>ICU Comparison Levels</title> <para> Comparison of two strings (collation) in ICU is determined by a multi-level process, where textual features are grouped into "levels". Treatment of each level is controlled by the <link linkend="icu-collation-settings-table">collation settings</link>. Higher levels correspond to finer textual features. </para> <para> <xref linkend="icu-collation-levels"/> shows which textual feature differences are considered significant when

This section explains how to create custom collations using ICU, including how to define collation settings as part of the language tag, and how to use various options to modify the collation order, such as ignoring accents and case, sorting upper case letters before lower case, and treating digits numerically while ignoring punctuation.