Home Explore Blog CI



postgresql

31th chunk of `doc/src/sgml/textsearch.sgml`
b19b5e92554e4fb882f5cbd4e958fbf3a3e065400363b9030000000100000fa1
 predefined
   dictionary template is described below.  If no existing
   template is suitable, it is possible to create new ones; see the
   <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution
   for examples.
  </para>

  <para>
   A text search configuration binds a parser together with a set of
   dictionaries to process the parser's output tokens.  For each token
   type that the parser can return, a separate list of dictionaries is
   specified by the configuration.  When a token of that type is found
   by the parser, each dictionary in the list is consulted in turn,
   until some dictionary recognizes it as a known word.  If it is identified
   as a stop word, or if no dictionary recognizes the token, it will be
   discarded and not indexed or searched for.
   Normally, the first dictionary that returns a non-<literal>NULL</literal>
   output determines the result, and any remaining dictionaries are not
   consulted; but a filtering dictionary can replace the given word
   with a modified word, which is then passed to subsequent dictionaries.
  </para>

  <para>
   The general rule for configuring a list of dictionaries
   is to place first the most narrow, most specific dictionary, then the more
   general dictionaries, finishing with a very general dictionary, like
   a <application>Snowball</application> stemmer or <literal>simple</literal>, which
   recognizes everything.  For example, for an astronomy-specific search
   (<literal>astro_en</literal> configuration) one could bind token type
   <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
   terms, a general English dictionary and a <application>Snowball</application> English
   stemmer:

<programlisting>
ALTER TEXT SEARCH CONFIGURATION astro_en
    ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
</programlisting>
  </para>

  <para>
   A filtering dictionary can be placed anywhere in the list, except at the
   end where it'd be useless.  Filtering dictionaries are useful to partially
   normalize words to simplify the task of later dictionaries.  For example,
   a filtering dictionary could be used to remove accents from accented
   letters, as is done by the <xref linkend="unaccent"/> module.
  </para>

  <sect2 id="textsearch-stopwords">
   <title>Stop Words</title>

   <para>
    Stop words are words that are very common, appear in almost every
    document, and have no discrimination value. Therefore, they can be ignored
    in the context of full text searching. For example, every English text
    contains words like <literal>a</literal> and <literal>the</literal>, so it is
    useless to store them in an index.  However, stop words do affect the
    positions in <type>tsvector</type>, which in turn affect ranking:

<screen>
SELECT to_tsvector('english', 'in the list of stop words');
        to_tsvector
----------------------------
 'list':3 'stop':5 'word':6
</screen>

    The missing positions 1,2,4 are because of stop words.  Ranks
    calculated for documents with and without stop words are quite different:

<screen>
SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list &amp; stop'));
 ts_rank_cd
------------
       0.05

SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &amp; stop'));
 ts_rank_cd
------------
        0.1
</screen>

   </para>

   <para>
    It is up to the specific dictionary how it treats stop words. For example,
    <literal>ispell</literal> dictionaries first normalize words and then
    look at the list of stop words, while <literal>Snowball</literal> stemmers
    first check the list of stop words. The reason for the different
    behavior is an attempt to decrease noise.
   </para>

  </sect2>

  <sect2 id="textsearch-simple-dictionary">
   <title>Simple Dictionary</title>

   <para>
    The <literal>simple</literal> dictionary template operates by converting the
    input token

Title: Text Search Configuration and Stop Words
Summary
The text explains how text search configurations bind parsers and dictionaries to process tokens, using lists of dictionaries for each token type. It advises ordering dictionaries from specific to general, with Snowball stemmers or "simple" dictionaries at the end. Filtering dictionaries can be used to normalize words. Stop words, being common and lacking discrimination value, are ignored in indexing but affect tsvector positions and ranking. Dictionaries handle stop words differently, such as ispell normalizing before checking and Snowball stemmers checking first.