Home Explore Blog CI



postgresql

38th chunk of `doc/src/sgml/textsearch.sgml`
bf3d81a5d881b1f9ab2f778fba249fdf352e668a3fda799a0000000100000fa0
 the <filename>$SHAREDIR/tsearch_data</filename> directory
     </para>
    </listitem>
    <listitem>
     <para>
      load files into PostgreSQL with the following command:
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_hunspell (
    TEMPLATE = ispell,
    DictFile = en_us,
    AffFile = en_us,
    Stopwords = english);
</programlisting>
     </para>
    </listitem>
   </itemizedlist>

   <para>
    Here, <literal>DictFile</literal>, <literal>AffFile</literal>, and <literal>StopWords</literal>
    specify the base names of the dictionary, affixes, and stop-words files.
    The stop-words file has the same format explained above for the
    <literal>simple</literal> dictionary type.  The format of the other files is
    not specified here but is available from the above-mentioned web sites.
   </para>

   <para>
    Ispell dictionaries usually recognize a limited set of words, so they
    should be followed by another broader dictionary; for
    example, a Snowball dictionary, which recognizes everything.
   </para>

   <para>
    The <filename>.affix</filename> file of <application>Ispell</application> has the following
    structure:
<programlisting>
prefixes
flag *A:
    .           >   RE      # As in enter > reenter
suffixes
flag T:
    E           >   ST      # As in late > latest
    [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
    [AEIOU]Y    >   EST     # As in gray > grayest
    [^EY]       >   EST     # As in small > smallest
</programlisting>
   </para>
   <para>
    And the <filename>.dict</filename> file has the following structure:
<programlisting>
lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS
</programlisting>
   </para>

   <para>
    Format of the <filename>.dict</filename> file is:
<programlisting>
basic_form/affix_class_name
</programlisting>
   </para>

   <para>
    In the <filename>.affix</filename> file every affix flag is described in the
    following format:
<programlisting>
condition > [-stripping_letters,] adding_affix
</programlisting>
   </para>

   <para>
    Here, condition has a format similar to the format of regular expressions.
    It can use groupings <literal>[...]</literal> and <literal>[^...]</literal>.
    For example, <literal>[AEIOU]Y</literal> means that the last letter of the word
    is <literal>"y"</literal> and the penultimate letter is <literal>"a"</literal>,
    <literal>"e"</literal>, <literal>"i"</literal>, <literal>"o"</literal> or <literal>"u"</literal>.
    <literal>[^EY]</literal> means that the last letter is neither <literal>"e"</literal>
    nor <literal>"y"</literal>.
   </para>

   <para>
    Ispell dictionaries support splitting compound words;
    a useful feature.
    Notice that the affix file should specify a special flag using the
    <literal>compoundwords controlled</literal> statement that marks dictionary
    words that can participate in compound formation:

<programlisting>
compoundwords  controlled z
</programlisting>

    Here are some examples for the Norwegian language:

<programlisting>
SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
   {over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
   {sjokoladefabrikk,sjokolade,fabrikk}
</programlisting>
   </para>

   <para>
    <application>MySpell</application> format is a subset of <application>Hunspell</application>.
    The <filename>.affix</filename> file of <application>Hunspell</application> has the following
    structure:
<programlisting>
PFX A Y 1
PFX A   0     re         .
SFX T N 4
SFX T   0     st         e
SFX T   y     iest       [^aeiou]y
SFX T   0     est        [aeiou]y
SFX T   0     est        [^ey]
</programlisting>
   </para>

   <para>
    The first line of an affix class is the header. Fields of an affix rules are
    listed after the header:
   </para>
   <itemizedlist spacing="compact" mark="bullet">
    <listitem>
     <para>
      parameter name (PFX or SFX)
     </para>
    </listitem>
  

Title: Ispell Dictionary Details and Hunspell Format
Summary
This section elaborates on Ispell dictionaries, noting their limited word recognition and recommending a broader dictionary like Snowball for comprehensive coverage. It details the structure of the `.affix` and `.dict` files, explaining affix flags, conditions, and how they relate to word transformations. It covers compound word splitting, a useful Ispell feature, and introduces the compoundwords controlled statement. The section then shifts to the Hunspell format, explaining its `.affix` file structure and the parameters within affix rules. It specifically notes that MySpell is a subset of Hunspell.