<para>
Basically a thesaurus dictionary replaces all non-preferred terms by one
preferred term and, optionally, preserves the original terms for indexing
as well. <productname>PostgreSQL</productname>'s current implementation of the
thesaurus dictionary is an extension of the synonym dictionary with added
<firstterm>phrase</firstterm> support. A thesaurus dictionary requires
a configuration file of the following format:
<programlisting>
# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...
</programlisting>
where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
phrase and its replacement.
</para>
<para>
A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
is specified in the dictionary's configuration) to normalize the input
text before checking for phrase matches. It is only possible to select one
subdictionary. An error is reported if the subdictionary fails to
recognize a word. In that case, you should remove the use of the word or
teach the subdictionary about it. You can place an asterisk
(<symbol>*</symbol>) at the beginning of an indexed word to skip applying
the subdictionary to it, but all sample words <emphasis>must</emphasis> be known
to the subdictionary.
</para>
<para>
The thesaurus dictionary chooses the longest match if there are multiple
phrases matching the input, and ties are broken by using the last
definition.
</para>
<para>
Specific stop words recognized by the subdictionary cannot be
specified; instead use <literal>?</literal> to mark the location where any
stop word can appear. For example, assuming that <literal>a</literal> and
<literal>the</literal> are stop words according to the subdictionary:
<programlisting>
? one ? two : swsw
</programlisting>
matches <literal>a one the two</literal> and <literal>the one a two</literal>;
both would be replaced by <literal>swsw</literal>.
</para>
<para>
Since a thesaurus dictionary has the capability to recognize phrases it
must remember its state and interact with the parser. A thesaurus dictionary
uses these assignments to check if it should handle the next word or stop
accumulation. The thesaurus dictionary must be configured
carefully. For example, if the thesaurus dictionary is assigned to handle
only the <literal>asciiword</literal> token, then a thesaurus dictionary
definition like <literal>one 7</literal> will not work since token type
<literal>uint</literal> is not assigned to the thesaurus dictionary.
</para>
<caution>
<para>
Thesauruses are used during indexing so any change in the thesaurus
dictionary's parameters <emphasis>requires</emphasis> reindexing.
For most other dictionary types, small changes such as adding or
removing stopwords does not force reindexing.
</para>
</caution>
<sect3 id="textsearch-thesaurus-config">
<title>Thesaurus Configuration</title>
<para>
To define a new thesaurus dictionary, use the <literal>thesaurus</literal>
template. For example:
<programlisting>
CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
TEMPLATE = thesaurus,
DictFile = mythesaurus,
Dictionary = pg_catalog.english_stem
);
</programlisting>
Here:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
<literal>thesaurus_simple</literal> is the new dictionary's name
</para>
</listitem>
<listitem>
<para>
<literal>mythesaurus</literal> is the base name of the thesaurus
configuration file.
(Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>,
where <literal>$SHAREDIR</literal> means the installation shared-data
directory.)
</para>
</listitem>
<listitem>
<para>
<literal>pg_catalog.english_stem</literal>