Home Explore Blog CI



postgresql

32th chunk of `doc/src/sgml/textsearch.sgml`
fe73541cf4f62999d6d5b389bce45fc239b9055041a633e80000000100000fa0
 positions 1,2,4 are because of stop words.  Ranks
    calculated for documents with and without stop words are quite different:

<screen>
SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list &amp; stop'));
 ts_rank_cd
------------
       0.05

SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &amp; stop'));
 ts_rank_cd
------------
        0.1
</screen>

   </para>

   <para>
    It is up to the specific dictionary how it treats stop words. For example,
    <literal>ispell</literal> dictionaries first normalize words and then
    look at the list of stop words, while <literal>Snowball</literal> stemmers
    first check the list of stop words. The reason for the different
    behavior is an attempt to decrease noise.
   </para>

  </sect2>

  <sect2 id="textsearch-simple-dictionary">
   <title>Simple Dictionary</title>

   <para>
    The <literal>simple</literal> dictionary template operates by converting the
    input token to lower case and checking it against a file of stop words.
    If it is found in the file then an empty array is returned, causing
    the token to be discarded.  If not, the lower-cased form of the word
    is returned as the normalized lexeme.  Alternatively, the dictionary
    can be configured to report non-stop-words as unrecognized, allowing
    them to be passed on to the next dictionary in the list.
   </para>

   <para>
    Here is an example of a dictionary definition using the <literal>simple</literal>
    template:

<programlisting>
CREATE TEXT SEARCH DICTIONARY public.simple_dict (
    TEMPLATE = pg_catalog.simple,
    STOPWORDS = english
);
</programlisting>

    Here, <literal>english</literal> is the base name of a file of stop words.
    The file's full name will be
    <filename>$SHAREDIR/tsearch_data/english.stop</filename>,
    where <literal>$SHAREDIR</literal> means the
    <productname>PostgreSQL</productname> installation's shared-data directory,
    often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config
    --sharedir</command> to determine it if you're not sure).
    The file format is simply a list
    of words, one per line.  Blank lines and trailing spaces are ignored,
    and upper case is folded to lower case, but no other processing is done
    on the file contents.
   </para>

   <para>
    Now we can test our dictionary:

<screen>
SELECT ts_lexize('public.simple_dict', 'YeS');
 ts_lexize
-----------
 {yes}

SELECT ts_lexize('public.simple_dict', 'The');
 ts_lexize
-----------
 {}
</screen>
   </para>

   <para>
    We can also choose to return <literal>NULL</literal>, instead of the lower-cased
    word, if it is not found in the stop words file.  This behavior is
    selected by setting the dictionary's <literal>Accept</literal> parameter to
    <literal>false</literal>.  Continuing the example:

<screen>
ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );

SELECT ts_lexize('public.simple_dict', 'YeS');
 ts_lexize
-----------


SELECT ts_lexize('public.simple_dict', 'The');
 ts_lexize
-----------
 {}
</screen>
   </para>

   <para>
    With the default setting of <literal>Accept</literal> = <literal>true</literal>,
    it is only useful to place a <literal>simple</literal> dictionary at the end
    of a list of dictionaries, since it will never pass on any token to
    a following dictionary.  Conversely, <literal>Accept</literal> = <literal>false</literal>
    is only useful when there is at least one following dictionary.
   </para>

   <caution>
    <para>
     Most types of dictionaries rely on configuration files, such as files of
     stop words.  These files <emphasis>must</emphasis> be stored in UTF-8 encoding.
     They will be translated to the actual database encoding, if that is
     different, when they are read into the server.
    </para>
   </caution>

   <caution>
    <para>
     Normally, a database session will read a dictionary configuration file
 

Title: Simple Dictionary for Text Search
Summary
The simple dictionary template converts input tokens to lowercase and checks them against a stop word file. If found, the token is discarded; otherwise, the lowercased word is returned or NULL is returned if the "Accept" parameter is set to false. The example demonstrates creating and testing a simple dictionary, including altering its behavior to return NULL for non-stop words. Simple dictionaries with default settings are best placed at the end of a dictionary list, while those with Accept = false are useful with following dictionaries. Configuration files, such as stop word files, must be stored in UTF-8 encoding.