Unaccent: A Text Search Dictionary for Diacritic Removal

<sect1 id="unaccent" xreflabel="unaccent"> <title>unaccent — a text search dictionary which removes diacritics</title> <indexterm zone="unaccent"> <primary>unaccent</primary> </indexterm> <para> <filename>unaccent</filename> is a text search dictionary that removes accents (diacritic signs) from lexemes. It's a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search. </para> <para> The current implementation of <filename>unaccent</filename> cannot be used as a normalizing dictionary for the <filename>thesaurus</filename> dictionary. </para> <para> This module is considered <quote>trusted</quote>, that is, it can be installed by non-superusers who have <literal>CREATE</literal> privilege on the current database. </para> <sect2 id="unaccent-configuration"> <title>Configuration</title> <para> An <literal>unaccent</literal> dictionary accepts the following options: </para> <itemizedlist> <listitem> <para> <literal>RULES</literal> is the base name of the file containing the list of translation rules. This file must be stored in <filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means the <productname>PostgreSQL</productname> installation's shared-data directory). Its name must end in <literal>.rules</literal> (which is not to be included in the <literal>RULES</literal> parameter). </para> </listitem> </itemizedlist> <para> The rules file has the following format: </para> <itemizedlist> <listitem> <para> Each line represents one translation rule, consisting of a character with accent followed by a character without accent. The first is translated into the second. For example, <programlisting> À A Á A Â A Ã A Ä A Å A Æ AE </programlisting> The two characters must be separated by whitespace, and any leading or trailing whitespace on a line is ignored. </para> </listitem> <listitem> <para> Alternatively, if only one character is given on a line, instances of that character are deleted; this is useful in languages where accents are represented by separate characters. </para> </listitem> <listitem> <para> Actually, each <quote>character</quote> can be any string not containing whitespace, so <filename>unaccent</filename> dictionaries could be used for other sorts of substring substitutions besides diacritic removal. </para> </listitem> <listitem> <para> Some characters, like numeric symbols, may require whitespaces in their translation rule. It is possible to use double quotes around the translated characters in this case. A double quote needs to be escaped with a second double quote when including one in the translated character. For example: <programlisting> ¼ " 1/4" ½ " 1/2" ¾ " 3/4" “ """" ” """" </programlisting> </para> </listitem> <listitem> <para> As with other <productname>PostgreSQL</productname> text search configuration files, the rules file must be stored in UTF-8 encoding. The data is automatically translated

The unaccent dictionary is a filtering dictionary that removes diacritics from lexemes, allowing for accent-insensitive full text search, and can be configured using a rules file with translation rules for character substitutions.