Home Explore Blog CI



postgresql

35th chunk of `doc/src/sgml/textsearch.sgml`
c17273c195973ee8bc30764b383494a00f5ce4310ac59f370000000100000fac
 <para>
    Basically a thesaurus dictionary replaces all non-preferred terms by one
    preferred term and, optionally, preserves the original terms for indexing
    as well.  <productname>PostgreSQL</productname>'s current implementation of the
    thesaurus dictionary is an extension of the synonym dictionary with added
    <firstterm>phrase</firstterm> support.  A thesaurus dictionary requires
    a configuration file of the following format:

<programlisting>
# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...
</programlisting>

    where  the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
    phrase and its replacement.
   </para>

   <para>
    A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
    is specified in the dictionary's configuration) to normalize the input
    text before checking for phrase matches. It is only possible to select one
    subdictionary.  An error is reported if the subdictionary fails to
    recognize a word. In that case, you should remove the use of the word or
    teach the subdictionary about it.  You can place an asterisk
    (<symbol>*</symbol>) at the beginning of an indexed word to skip applying
    the subdictionary to it, but all sample words <emphasis>must</emphasis> be known
    to the subdictionary.
   </para>

   <para>
    The thesaurus dictionary chooses the longest match if there are multiple
    phrases matching the input, and ties are broken by using the last
    definition.
   </para>

   <para>
    Specific stop words recognized by the subdictionary cannot be
    specified;  instead use <literal>?</literal> to mark the location where any
    stop word can appear.  For example, assuming that <literal>a</literal> and
    <literal>the</literal> are stop words according to the subdictionary:

<programlisting>
? one ? two : swsw
</programlisting>

    matches <literal>a one the two</literal> and <literal>the one a two</literal>;
    both would be replaced by <literal>swsw</literal>.
   </para>

   <para>
    Since a thesaurus dictionary has the capability to recognize phrases it
    must remember its state and interact with the parser. A thesaurus dictionary
    uses these assignments to check if it should handle the next word or stop
    accumulation.  The thesaurus dictionary must be configured
    carefully. For example, if the thesaurus dictionary is assigned to handle
    only the <literal>asciiword</literal> token, then a thesaurus dictionary
    definition like <literal>one 7</literal> will not work since token type
    <literal>uint</literal> is not assigned to the thesaurus dictionary.
   </para>

   <caution>
    <para>
     Thesauruses are used during indexing so any change in the thesaurus
     dictionary's parameters <emphasis>requires</emphasis> reindexing.
     For most other dictionary types, small changes such as adding or
     removing stopwords does not force reindexing.
    </para>
   </caution>

  <sect3 id="textsearch-thesaurus-config">
   <title>Thesaurus Configuration</title>

   <para>
    To define a new thesaurus dictionary, use the <literal>thesaurus</literal>
    template.  For example:

<programlisting>
CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
    TEMPLATE = thesaurus,
    DictFile = mythesaurus,
    Dictionary = pg_catalog.english_stem
);
</programlisting>

    Here:
    <itemizedlist  spacing="compact" mark="bullet">
     <listitem>
      <para>
       <literal>thesaurus_simple</literal> is the new dictionary's name
      </para>
     </listitem>
     <listitem>
      <para>
       <literal>mythesaurus</literal> is the base name of the thesaurus
       configuration file.
       (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>,
       where <literal>$SHAREDIR</literal> means the installation shared-data
       directory.)
      </para>
     </listitem>
     <listitem>
      <para>
       <literal>pg_catalog.english_stem</literal>

Title: Thesaurus Dictionary Details and Configuration
Summary
The thesaurus dictionary replaces non-preferred terms with preferred ones, extending the synonym dictionary with phrase support. It uses a subdictionary for input normalization, allowing one subdictionary and reporting an error if the subdictionary doesn't recognize a word. An asterisk can skip subdictionary application for indexed words. The thesaurus dictionary chooses the longest match and uses the last definition to break ties. Specific stop words can be marked with '?'. It remembers its state and interacts with the parser, requiring careful configuration to handle appropriate token types. Changes to the thesaurus dictionary require reindexing. The thesaurus dictionary is created using the 'thesaurus' template, specifying the base file name and subdictionary.