Home Explore Blog CI



postgresql

1st chunk of `doc/src/sgml/unaccent.sgml`
37855892e463228ffe53bbfcbac5e9bc50cf975aa03d42220000000100000daf
<!-- doc/src/sgml/unaccent.sgml -->

<sect1 id="unaccent" xreflabel="unaccent">
 <title>unaccent &mdash; a text search dictionary which removes diacritics</title>

 <indexterm zone="unaccent">
  <primary>unaccent</primary>
 </indexterm>

 <para>
  <filename>unaccent</filename> is a text search dictionary that removes accents
  (diacritic signs) from lexemes.
  It's a filtering dictionary, which means its output is
  always passed to the next dictionary (if any), unlike the normal
  behavior of dictionaries.  This allows accent-insensitive processing
  for full text search.
 </para>

 <para>
  The current implementation of <filename>unaccent</filename> cannot be used as a
  normalizing dictionary for the <filename>thesaurus</filename> dictionary.
 </para>

 <para>
  This module is considered <quote>trusted</quote>, that is, it can be
  installed by non-superusers who have <literal>CREATE</literal> privilege
  on the current database.
 </para>

 <sect2 id="unaccent-configuration">
  <title>Configuration</title>

  <para>
   An <literal>unaccent</literal> dictionary accepts the following options:
  </para>
  <itemizedlist>
   <listitem>
    <para>
     <literal>RULES</literal> is the base name of the file containing the list of
     translation rules.  This file must be stored in
     <filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
     the <productname>PostgreSQL</productname> installation's shared-data directory).
     Its name must end in <literal>.rules</literal> (which is not to be included in
     the <literal>RULES</literal> parameter).
    </para>
   </listitem>
  </itemizedlist>
  <para>
   The rules file has the following format:
  </para>
  <itemizedlist>
   <listitem>
    <para>
     Each line represents one translation rule, consisting of a character with
     accent followed by a character without accent.  The first is translated
     into the second.  For example,
<programlisting>
&Agrave;        A
&Aacute;        A
&Acirc;        A
&Atilde;        A
&Auml;        A
&Aring;        A
&AElig;        AE
</programlisting>
     The two characters must be separated by whitespace, and any leading or
     trailing whitespace on a line is ignored.
    </para>
   </listitem>

   <listitem>
    <para>
     Alternatively, if only one character is given on a line, instances of
     that character are deleted; this is useful in languages where accents
     are represented by separate characters.
    </para>
   </listitem>

   <listitem>
    <para>
     Actually, each <quote>character</quote> can be any string not containing
     whitespace, so <filename>unaccent</filename> dictionaries could be used for
     other sorts of substring substitutions besides diacritic removal.
    </para>
   </listitem>

   <listitem>
    <para>
     Some characters, like numeric symbols, may require whitespaces in their
     translation rule. It is possible to use double quotes around the translated
     characters in this case. A double quote needs to be escaped with a second
     double quote when including one in the translated character. For example:
<programlisting>
&frac14;      " 1/4"
&frac12;      " 1/2"
&frac34;      " 3/4"
&ldquo;       """"
&rdquo;       """"
</programlisting>
    </para>
   </listitem>

   <listitem>
    <para>
     As with other <productname>PostgreSQL</productname> text search configuration files,
     the rules file must be stored in UTF-8 encoding.  The data is
     automatically translated

Title: Unaccent: A Text Search Dictionary for Diacritic Removal
Summary
The unaccent dictionary is a filtering dictionary that removes diacritics from lexemes, allowing for accent-insensitive full text search, and can be configured using a rules file with translation rules for character substitutions.