predefined
dictionary template is described below. If no existing
template is suitable, it is possible to create new ones; see the
<filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution
for examples.
</para>
<para>
A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token
type that the parser can return, a separate list of dictionaries is
specified by the configuration. When a token of that type is found
by the parser, each dictionary in the list is consulted in turn,
until some dictionary recognizes it as a known word. If it is identified
as a stop word, or if no dictionary recognizes the token, it will be
discarded and not indexed or searched for.
Normally, the first dictionary that returns a non-<literal>NULL</literal>
output determines the result, and any remaining dictionaries are not
consulted; but a filtering dictionary can replace the given word
with a modified word, which is then passed to subsequent dictionaries.
</para>
<para>
The general rule for configuring a list of dictionaries
is to place first the most narrow, most specific dictionary, then the more
general dictionaries, finishing with a very general dictionary, like
a <application>Snowball</application> stemmer or <literal>simple</literal>, which
recognizes everything. For example, for an astronomy-specific search
(<literal>astro_en</literal> configuration) one could bind token type
<type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
terms, a general English dictionary and a <application>Snowball</application> English
stemmer:
<programlisting>
ALTER TEXT SEARCH CONFIGURATION astro_en
ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
</programlisting>
</para>
<para>
A filtering dictionary can be placed anywhere in the list, except at the
end where it'd be useless. Filtering dictionaries are useful to partially
normalize words to simplify the task of later dictionaries. For example,
a filtering dictionary could be used to remove accents from accented
letters, as is done by the <xref linkend="unaccent"/> module.
</para>
<sect2 id="textsearch-stopwords">
<title>Stop Words</title>
<para>
Stop words are words that are very common, appear in almost every
document, and have no discrimination value. Therefore, they can be ignored
in the context of full text searching. For example, every English text
contains words like <literal>a</literal> and <literal>the</literal>, so it is
useless to store them in an index. However, stop words do affect the
positions in <type>tsvector</type>, which in turn affect ranking:
<screen>
SELECT to_tsvector('english', 'in the list of stop words');
to_tsvector
----------------------------
'list':3 'stop':5 'word':6
</screen>
The missing positions 1,2,4 are because of stop words. Ranks
calculated for documents with and without stop words are quite different:
<screen>
SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.05
SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.1
</screen>
</para>
<para>
It is up to the specific dictionary how it treats stop words. For example,
<literal>ispell</literal> dictionaries first normalize words and then
look at the list of stop words, while <literal>Snowball</literal> stemmers
first check the list of stop words. The reason for the different
behavior is an attempt to decrease noise.
</para>
</sect2>
<sect2 id="textsearch-simple-dictionary">
<title>Simple Dictionary</title>
<para>
The <literal>simple</literal> dictionary template operates by converting the
input token