Home Explore Blog CI



postgresql

17th chunk of `doc/src/sgml/charset.sgml`
862fa797dd561946f5cf866f01c3aad5086fc44488fdff810000000100000fa2
 See also <ulink url="https://www.unicode.org/reports/tr10">Unicode Technical
     Standard 10</ulink> for more information on the terminology.
    </para>

    <para>
     To create a nondeterministic collation, specify the property
     <literal>deterministic = false</literal> to <command>CREATE
     COLLATION</command>, for example:
<programlisting>
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
</programlisting>
     This example would use the standard Unicode collation in a
     nondeterministic way.  In particular, this would allow strings in
     different normal forms to be compared correctly.  More interesting
     examples make use of the ICU customization facilities explained above.
     For example:
<programlisting>
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
</programlisting>
    </para>

    <para>
     All standard and predefined collations are deterministic, all
     user-defined collations are deterministic by default.  While
     nondeterministic collations give a more <quote>correct</quote> behavior,
     especially when considering the full power of Unicode and its many
     special cases, they also have some drawbacks.  Foremost, their use leads
     to a performance penalty.  Note, in particular, that B-tree cannot use
     deduplication with indexes that use a nondeterministic collation.  Also,
     certain operations are not possible with nondeterministic collations,
     such as some pattern matching operations.  Therefore, they should be used
     only in cases where they are specifically wanted.
    </para>

    <tip>
     <para>
      To deal with text in different Unicode normalization forms, it is also
      an option to use the functions/expressions
      <function>normalize</function> and <literal>is normalized</literal> to
      preprocess or check the strings, instead of using nondeterministic
      collations.  There are different trade-offs for each approach.
     </para>
    </tip>
   </sect3>
  </sect2>

  <sect2 id="icu-custom-collations">
   <title>ICU Custom Collations</title>

   <para>
    ICU allows extensive control over collation behavior by defining new
    collations with collation settings as a part of the language tag. These
    settings can modify the collation order to suit a variety of needs. For
    instance:

<programlisting>
-- ignore differences in accents and case
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true

-- upper case letters sort before lower case.
CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper');
SELECT 'B' &lt; 'b' COLLATE upper_first; -- true

-- treat digits numerically and ignore punctuation
CREATE COLLATION num_ignore_punct (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-kn');
SELECT 'id-45' &lt; 'id-123' COLLATE num_ignore_punct; -- true
SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
</programlisting>

    Many of the available options are described in <xref
    linkend="icu-collation-settings"/>, or see <xref
    linkend="icu-external-references"/> for more details.
   </para>

   <sect3 id="icu-collation-comparison-levels">
    <title>ICU Comparison Levels</title>

    <para>
     Comparison of two strings (collation) in ICU is determined by a
     multi-level process, where textual features are grouped into
     "levels". Treatment of each level is controlled by the <link
     linkend="icu-collation-settings-table">collation settings</link>. Higher
     levels correspond to finer textual features.
    </para>

    <para>
     <xref linkend="icu-collation-levels"/> shows which textual feature
     differences are considered significant when

Title: Customizing Collations with ICU
Summary
This section explains how to create custom collations using ICU, including how to define collation settings as part of the language tag, and how to use various options to modify the collation order, such as ignoring accents and case, sorting upper case letters before lower case, and treating digits numerically while ignoring punctuation.