principle token classes depend on the specific
application, but for most purposes it is adequate to use a predefined
set of classes.
<productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to
perform this step. A standard parser is provided, and custom parsers
can be created for specific needs.
</para>
</listitem>
<listitem>
<para>
<emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>.
A lexeme is a string, just like a token, but it has been
<firstterm>normalized</firstterm> so that different forms of the same word
are made alike. For example, normalization almost always includes
folding upper-case letters to lower-case, and often involves removal
of suffixes (such as <literal>s</literal> or <literal>es</literal> in English).
This allows searches to find variant forms of the
same word, without tediously entering all the possible variants.
Also, this step typically eliminates <firstterm>stop words</firstterm>, which
are words that are so common that they are useless for searching.
(In short, then, tokens are raw fragments of the document text, while
lexemes are words that are believed useful for indexing and searching.)
<productname>PostgreSQL</productname> uses <firstterm>dictionaries</firstterm> to
perform this step. Various standard dictionaries are provided, and
custom ones can be created for specific needs.
</para>
</listitem>
<listitem>
<para>
<emphasis>Storing preprocessed documents optimized for
searching</emphasis>. For example, each document can be represented
as a sorted array of normalized lexemes. Along with the lexemes it is
often desirable to store positional information to use for
<firstterm>proximity ranking</firstterm>, so that a document that
contains a more <quote>dense</quote> region of query words is
assigned a higher rank than one with scattered query words.
</para>
</listitem>
</itemizedlist>
<para>
Dictionaries allow fine-grained control over how tokens are normalized.
With appropriate dictionaries, you can:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
Define stop words that should not be indexed.
</para>
</listitem>
<listitem>
<para>
Map synonyms to a single word using <application>Ispell</application>.
</para>
</listitem>
<listitem>
<para>
Map phrases to a single word using a thesaurus.
</para>
</listitem>
<listitem>
<para>
Map different variations of a word to a canonical form using
an <application>Ispell</application> dictionary.
</para>
</listitem>
<listitem>
<para>
Map different variations of a word to a canonical form using
<application>Snowball</application> stemmer rules.
</para>
</listitem>
</itemizedlist>
<para>
A data type <type>tsvector</type> is provided for storing preprocessed
documents, along with a type <type>tsquery</type> for representing processed
queries (<xref linkend="datatype-textsearch"/>). There are many
functions and operators available for these data types
(<xref linkend="functions-textsearch"/>), the most important of which is
the match operator <literal>@@</literal>, which we introduce in
<xref linkend="textsearch-matching"/>. Full text searches can be accelerated
using indexes (<xref linkend="textsearch-indexes"/>).
</para>
<sect2 id="textsearch-document">
<title>What Is a Document?</title>
<indexterm zone="textsearch-document">
<primary>document</primary>
<secondary>text search</secondary>
</indexterm>
<para>
A <firstterm>document</firstterm> is the unit of searching in a full text search
system; for example, a magazine article or email message. The text search
engine must be able to parse documents and store associations of lexemes