Home Explore Blog CI



postgresql

39th chunk of `doc/src/sgml/textsearch.sgml`
3d9f41255b6a13a3c288c83a237b5e4a4bce9f99049170560000000100000fa0
 Norwegian language:

<programlisting>
SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
   {over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
   {sjokoladefabrikk,sjokolade,fabrikk}
</programlisting>
   </para>

   <para>
    <application>MySpell</application> format is a subset of <application>Hunspell</application>.
    The <filename>.affix</filename> file of <application>Hunspell</application> has the following
    structure:
<programlisting>
PFX A Y 1
PFX A   0     re         .
SFX T N 4
SFX T   0     st         e
SFX T   y     iest       [^aeiou]y
SFX T   0     est        [aeiou]y
SFX T   0     est        [^ey]
</programlisting>
   </para>

   <para>
    The first line of an affix class is the header. Fields of an affix rules are
    listed after the header:
   </para>
   <itemizedlist spacing="compact" mark="bullet">
    <listitem>
     <para>
      parameter name (PFX or SFX)
     </para>
    </listitem>
    <listitem>
     <para>
      flag (name of the affix class)
     </para>
    </listitem>
    <listitem>
     <para>
      stripping characters from beginning (at prefix) or end (at suffix) of the
      word
     </para>
    </listitem>
    <listitem>
     <para>
      adding affix
     </para>
    </listitem>
    <listitem>
     <para>
      condition that has a format similar to the format of regular expressions.
     </para>
    </listitem>
   </itemizedlist>

   <para>
    The <filename>.dict</filename> file looks like the <filename>.dict</filename> file of
    <application>Ispell</application>:
<programlisting>
larder/M
lardy/RT
large/RSPMYT
largehearted
</programlisting>
   </para>

   <note>
    <para>
     <application>MySpell</application> does not support compound words.
     <application>Hunspell</application> has sophisticated support for compound words. At
     present, <productname>PostgreSQL</productname> implements only the basic
     compound word operations of Hunspell.
    </para>
   </note>

  </sect2>

  <sect2 id="textsearch-snowball-dictionary">
   <title><application>Snowball</application> Dictionary</title>

   <para>
    The <application>Snowball</application> dictionary template is based on a project
    by Martin Porter, inventor of the popular Porter's stemming algorithm
    for the English language.  Snowball now provides stemming algorithms for
    many languages (see the <ulink url="https://snowballstem.org/">Snowball
    site</ulink> for more information).  Each algorithm understands how to
    reduce common variant forms of words to a base, or stem, spelling within
    its language.  A Snowball dictionary requires a <literal>language</literal>
    parameter to identify which stemmer to use, and optionally can specify a
    <literal>stopword</literal> file name that gives a list of words to eliminate.
    (<productname>PostgreSQL</productname>'s standard stopword lists are also
    provided by the Snowball project.)
    For example, there is a built-in definition equivalent to

<programlisting>
CREATE TEXT SEARCH DICTIONARY english_stem (
    TEMPLATE = snowball,
    Language = english,
    StopWords = english
);
</programlisting>

    The stopword file format is the same as already explained.
   </para>

   <para>
    A <application>Snowball</application> dictionary recognizes everything, whether
    or not it is able to simplify the word, so it should be placed
    at the end of the dictionary list. It is useless to have it
    before any other dictionary because a token will never pass through it to
    the next dictionary.
   </para>

  </sect2>

 </sect1>

 <sect1 id="textsearch-configuration">
  <title>Configuration Example</title>

   <para>
    A text search configuration specifies all options necessary to transform a
    document into a <type>tsvector</type>: the parser to use to break text
    into tokens, and the dictionaries to use to transform each token into a
    lexeme.  Every call of
   

Title: Hunspell Details and Snowball Dictionaries
Summary
This section provides examples of using `ts_lexize` with a Norwegian Ispell dictionary. It highlights the structure of Hunspell's `.affix` and `.dict` files, noting that MySpell doesn't support compound words but Hunspell does. PostgreSQL currently implements basic Hunspell compound word operations. The section then transitions to Snowball dictionaries, based on Martin Porter's stemming algorithms, requiring a language parameter and optionally a stopword file. It emphasizes that Snowball dictionaries should be placed last in the dictionary list as they recognize everything, preventing subsequent dictionaries from processing tokens.