Hunspell Details and Snowball Dictionaries

Norwegian language: <programlisting> SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent'); {over,buljong,terning,pakk,mester,assistent} SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); {sjokoladefabrikk,sjokolade,fabrikk} </programlisting> </para> <para> <application>MySpell</application> format is a subset of <application>Hunspell</application>. The <filename>.affix</filename> file of <application>Hunspell</application> has the following structure: <programlisting> PFX A Y 1 PFX A 0 re . SFX T N 4 SFX T 0 st e SFX T y iest [^aeiou]y SFX T 0 est [aeiou]y SFX T 0 est [^ey] </programlisting> </para> <para> The first line of an affix class is the header. Fields of an affix rules are listed after the header: </para> <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> parameter name (PFX or SFX) </para> </listitem> <listitem> <para> flag (name of the affix class) </para> </listitem> <listitem> <para> stripping characters from beginning (at prefix) or end (at suffix) of the word </para> </listitem> <listitem> <para> adding affix </para> </listitem> <listitem> <para> condition that has a format similar to the format of regular expressions. </para> </listitem> </itemizedlist> <para> The <filename>.dict</filename> file looks like the <filename>.dict</filename> file of <application>Ispell</application>: <programlisting> larder/M lardy/RT large/RSPMYT largehearted </programlisting> </para> <note> <para> <application>MySpell</application> does not support compound words. <application>Hunspell</application> has sophisticated support for compound words. At present, <productname>PostgreSQL</productname> implements only the basic compound word operations of Hunspell. </para> </note> </sect2> <sect2 id="textsearch-snowball-dictionary"> <title><application>Snowball</application> Dictionary</title> <para> The <application>Snowball</application> dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. Snowball now provides stemming algorithms for many languages (see the <ulink url="https://snowballstem.org/">Snowball site</ulink> for more information). Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. A Snowball dictionary requires a <literal>language</literal> parameter to identify which stemmer to use, and optionally can specify a <literal>stopword</literal> file name that gives a list of words to eliminate. (<productname>PostgreSQL</productname>'s standard stopword lists are also provided by the Snowball project.) For example, there is a built-in definition equivalent to <programlisting> CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english ); </programlisting> The stopword file format is the same as already explained. </para> <para> A <application>Snowball</application> dictionary recognizes everything, whether or not it is able to simplify the word, so it should be placed at the end of the dictionary list. It is useless to have it before any other dictionary because a token will never pass through it to the next dictionary. </para> </sect2> </sect1> <sect1 id="textsearch-configuration"> <title>Configuration Example</title> <para> A text search configuration specifies all options necessary to transform a document into a <type>tsvector</type>: the parser to use to break text into tokens, and the dictionaries to use to transform each token into a lexeme. Every call of

This section provides examples of using `ts_lexize` with a Norwegian Ispell dictionary. It highlights the structure of Hunspell's `.affix` and `.dict` files, noting that MySpell doesn't support compound words but Hunspell does. PostgreSQL currently implements basic Hunspell compound word operations. The section then transitions to Snowball dictionaries, based on Martin Porter's stemming algorithms, requiring a language parameter and optionally a stopword file. It emphasizes that Snowball dictionaries should be placed last in the dictionary list as they recognize everything, preventing subsequent dictionaries from processing tokens.