Thesaurus Dictionary Details and Configuration

<para> Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well. <productname>PostgreSQL</productname>'s current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added <firstterm>phrase</firstterm> support. A thesaurus dictionary requires a configuration file of the following format: <programlisting> # this is a comment sample word(s) : indexed word(s) more sample word(s) : more indexed word(s) ... </programlisting> where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a phrase and its replacement. </para> <para> A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which is specified in the dictionary's configuration) to normalize the input text before checking for phrase matches. It is only possible to select one subdictionary. An error is reported if the subdictionary fails to recognize a word. In that case, you should remove the use of the word or teach the subdictionary about it. You can place an asterisk (<symbol>*</symbol>) at the beginning of an indexed word to skip applying the subdictionary to it, but all sample words <emphasis>must</emphasis> be known to the subdictionary. </para> <para> The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition. </para> <para> Specific stop words recognized by the subdictionary cannot be specified; instead use <literal>?</literal> to mark the location where any stop word can appear. For example, assuming that <literal>a</literal> and <literal>the</literal> are stop words according to the subdictionary: <programlisting> ? one ? two : swsw </programlisting> matches <literal>a one the two</literal> and <literal>the one a two</literal>; both would be replaced by <literal>swsw</literal>. </para> <para> Since a thesaurus dictionary has the capability to recognize phrases it must remember its state and interact with the parser. A thesaurus dictionary uses these assignments to check if it should handle the next word or stop accumulation. The thesaurus dictionary must be configured carefully. For example, if the thesaurus dictionary is assigned to handle only the <literal>asciiword</literal> token, then a thesaurus dictionary definition like <literal>one 7</literal> will not work since token type <literal>uint</literal> is not assigned to the thesaurus dictionary. </para> <caution> <para> Thesauruses are used during indexing so any change in the thesaurus dictionary's parameters <emphasis>requires</emphasis> reindexing. For most other dictionary types, small changes such as adding or removing stopwords does not force reindexing. </para> </caution> <sect3 id="textsearch-thesaurus-config"> <title>Thesaurus Configuration</title> <para> To define a new thesaurus dictionary, use the <literal>thesaurus</literal> template. For example: <programlisting> CREATE TEXT SEARCH DICTIONARY thesaurus_simple ( TEMPLATE = thesaurus, DictFile = mythesaurus, Dictionary = pg_catalog.english_stem ); </programlisting> Here: <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> <literal>thesaurus_simple</literal> is the new dictionary's name </para> </listitem> <listitem> <para> <literal>mythesaurus</literal> is the base name of the thesaurus configuration file. (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>, where <literal>$SHAREDIR</literal> means the installation shared-data directory.) </para> </listitem> <listitem> <para> <literal>pg_catalog.english_stem</literal>

The thesaurus dictionary replaces non-preferred terms with preferred ones, extending the synonym dictionary with phrase support. It uses a subdictionary for input normalization, allowing one subdictionary and reporting an error if the subdictionary doesn't recognize a word. An asterisk can skip subdictionary application for indexed words. The thesaurus dictionary chooses the longest match and uses the last definition to break ties. Specific stop words can be marked with '?'. It remembers its state and interacts with the parser, requiring careful configuration to handle appropriate token types. Changes to the thesaurus dictionary require reindexing. The thesaurus dictionary is created using the 'thesaurus' template, specifying the base file name and subdictionary.