<!-- doc/src/sgml/unaccent.sgml -->
<sect1 id="unaccent" xreflabel="unaccent">
<title>unaccent — a text search dictionary which removes diacritics</title>
<indexterm zone="unaccent">
<primary>unaccent</primary>
</indexterm>
<para>
<filename>unaccent</filename> is a text search dictionary that removes accents
(diacritic signs) from lexemes.
It's a filtering dictionary, which means its output is
always passed to the next dictionary (if any), unlike the normal
behavior of dictionaries. This allows accent-insensitive processing
for full text search.
</para>
<para>
The current implementation of <filename>unaccent</filename> cannot be used as a
normalizing dictionary for the <filename>thesaurus</filename> dictionary.
</para>
<para>
This module is considered <quote>trusted</quote>, that is, it can be
installed by non-superusers who have <literal>CREATE</literal> privilege
on the current database.
</para>
<sect2 id="unaccent-configuration">
<title>Configuration</title>
<para>
An <literal>unaccent</literal> dictionary accepts the following options:
</para>
<itemizedlist>
<listitem>
<para>
<literal>RULES</literal> is the base name of the file containing the list of
translation rules. This file must be stored in
<filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
the <productname>PostgreSQL</productname> installation's shared-data directory).
Its name must end in <literal>.rules</literal> (which is not to be included in
the <literal>RULES</literal> parameter).
</para>
</listitem>
</itemizedlist>
<para>
The rules file has the following format:
</para>
<itemizedlist>
<listitem>
<para>
Each line represents one translation rule, consisting of a character with
accent followed by a character without accent. The first is translated
into the second. For example,
<programlisting>
À A
Á A
 A
à A
Ä A
Å A
Æ AE
</programlisting>
The two characters must be separated by whitespace, and any leading or
trailing whitespace on a line is ignored.
</para>
</listitem>
<listitem>
<para>
Alternatively, if only one character is given on a line, instances of
that character are deleted; this is useful in languages where accents
are represented by separate characters.
</para>
</listitem>
<listitem>
<para>
Actually, each <quote>character</quote> can be any string not containing
whitespace, so <filename>unaccent</filename> dictionaries could be used for
other sorts of substring substitutions besides diacritic removal.
</para>
</listitem>
<listitem>
<para>
Some characters, like numeric symbols, may require whitespaces in their
translation rule. It is possible to use double quotes around the translated
characters in this case. A double quote needs to be escaped with a second
double quote when including one in the translated character. For example:
<programlisting>
¼ " 1/4"
½ " 1/2"
¾ " 3/4"
“ """"
” """"
</programlisting>
</para>
</listitem>
<listitem>
<para>
As with other <productname>PostgreSQL</productname> text search configuration files,
the rules file must be stored in UTF-8 encoding. The data is
automatically translated