<firstterm>lexeme</firstterm>. Aside from
improving search quality, normalization and removal of stop words reduce the
size of the <type>tsvector</type> representation of a document, thereby
improving performance. Normalization does not always have linguistic meaning
and usually depends on application semantics.
</para>
<para>
Some examples of normalization:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
Linguistic — Ispell dictionaries try to reduce input words to a
normalized form; stemmer dictionaries remove word endings
</para>
</listitem>
<listitem>
<para>
<acronym>URL</acronym> locations can be canonicalized to make
equivalent URLs match:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
http://www.pgsql.ru/db/mw/index.html
</para>
</listitem>
<listitem>
<para>
http://www.pgsql.ru/db/mw/
</para>
</listitem>
<listitem>
<para>
http://www.pgsql.ru/db/../db/mw/index.html
</para>
</listitem>
</itemizedlist>
</para>
</listitem>
<listitem>
<para>
Color names can be replaced by their hexadecimal values, e.g.,
<literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
</para>
</listitem>
<listitem>
<para>
If indexing numbers, we can
remove some fractional digits to reduce the range of possible
numbers, so for example <emphasis>3.14</emphasis>159265359,
<emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
after normalization if only two digits are kept after the decimal point.
</para>
</listitem>
</itemizedlist>
</para>
<para>
A dictionary is a program that accepts a token as
input and returns:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
an array of lexemes if the input token is known to the dictionary
(notice that one token can produce more than one lexeme)
</para>
</listitem>
<listitem>
<para>
a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace
the original token with a new token to be passed to subsequent
dictionaries (a dictionary that does this is called a
<firstterm>filtering dictionary</firstterm>)
</para>
</listitem>
<listitem>
<para>
an empty array if the dictionary knows the token, but it is a stop word
</para>
</listitem>
<listitem>
<para>
<literal>NULL</literal> if the dictionary does not recognize the input token
</para>
</listitem>
</itemizedlist>
</para>
<para>
<productname>PostgreSQL</productname> provides predefined dictionaries for
many languages. There are also several predefined templates that can be
used to create new dictionaries with custom parameters. Each predefined
dictionary template is described below. If no existing
template is suitable, it is possible to create new ones; see the
<filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution
for examples.
</para>
<para>
A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token
type that the parser can return, a separate list of dictionaries is
specified by the configuration. When a token of that type is found
by the parser, each dictionary in the list is consulted in turn,
until some dictionary recognizes it as a known word. If it is identified
as a stop word, or if no dictionary recognizes the token, it will be
discarded and not indexed or searched for.
Normally, the first dictionary that returns a non-<literal>NULL</literal>
output determines the result, and any remaining dictionaries are not
consulted; but a filtering dictionary