positions 1,2,4 are because of stop words. Ranks
calculated for documents with and without stop words are quite different:
<screen>
SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.05
SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.1
</screen>
</para>
<para>
It is up to the specific dictionary how it treats stop words. For example,
<literal>ispell</literal> dictionaries first normalize words and then
look at the list of stop words, while <literal>Snowball</literal> stemmers
first check the list of stop words. The reason for the different
behavior is an attempt to decrease noise.
</para>
</sect2>
<sect2 id="textsearch-simple-dictionary">
<title>Simple Dictionary</title>
<para>
The <literal>simple</literal> dictionary template operates by converting the
input token to lower case and checking it against a file of stop words.
If it is found in the file then an empty array is returned, causing
the token to be discarded. If not, the lower-cased form of the word
is returned as the normalized lexeme. Alternatively, the dictionary
can be configured to report non-stop-words as unrecognized, allowing
them to be passed on to the next dictionary in the list.
</para>
<para>
Here is an example of a dictionary definition using the <literal>simple</literal>
template:
<programlisting>
CREATE TEXT SEARCH DICTIONARY public.simple_dict (
TEMPLATE = pg_catalog.simple,
STOPWORDS = english
);
</programlisting>
Here, <literal>english</literal> is the base name of a file of stop words.
The file's full name will be
<filename>$SHAREDIR/tsearch_data/english.stop</filename>,
where <literal>$SHAREDIR</literal> means the
<productname>PostgreSQL</productname> installation's shared-data directory,
often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config
--sharedir</command> to determine it if you're not sure).
The file format is simply a list
of words, one per line. Blank lines and trailing spaces are ignored,
and upper case is folded to lower case, but no other processing is done
on the file contents.
</para>
<para>
Now we can test our dictionary:
<screen>
SELECT ts_lexize('public.simple_dict', 'YeS');
ts_lexize
-----------
{yes}
SELECT ts_lexize('public.simple_dict', 'The');
ts_lexize
-----------
{}
</screen>
</para>
<para>
We can also choose to return <literal>NULL</literal>, instead of the lower-cased
word, if it is not found in the stop words file. This behavior is
selected by setting the dictionary's <literal>Accept</literal> parameter to
<literal>false</literal>. Continuing the example:
<screen>
ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
SELECT ts_lexize('public.simple_dict', 'YeS');
ts_lexize
-----------
SELECT ts_lexize('public.simple_dict', 'The');
ts_lexize
-----------
{}
</screen>
</para>
<para>
With the default setting of <literal>Accept</literal> = <literal>true</literal>,
it is only useful to place a <literal>simple</literal> dictionary at the end
of a list of dictionaries, since it will never pass on any token to
a following dictionary. Conversely, <literal>Accept</literal> = <literal>false</literal>
is only useful when there is at least one following dictionary.
</para>
<caution>
<para>
Most types of dictionaries rely on configuration files, such as files of
stop words. These files <emphasis>must</emphasis> be stored in UTF-8 encoding.
They will be translated to the actual database encoding, if that is
different, when they are read into the server.
</para>
</caution>
<caution>
<para>
Normally, a database session will read a dictionary configuration file