<!-- doc/src/sgml/textsearch.sgml -->
<chapter id="textsearch">
<title>Full Text Search</title>
<indexterm zone="textsearch">
<primary>full text search</primary>
</indexterm>
<indexterm zone="textsearch">
<primary>text search</primary>
</indexterm>
<sect1 id="textsearch-intro">
<title>Introduction</title>
<para>
Full Text Searching (or just <firstterm>text search</firstterm>) provides
the capability to identify natural-language <firstterm>documents</firstterm> that
satisfy a <firstterm>query</firstterm>, and optionally to sort them by
relevance to the query. The most common type of search
is to find all documents containing given <firstterm>query terms</firstterm>
and return them in order of their <firstterm>similarity</firstterm> to the
query. Notions of <varname>query</varname> and
<varname>similarity</varname> are very flexible and depend on the specific
application. The simplest search considers <varname>query</varname> as a
set of words and <varname>similarity</varname> as the frequency of query
words in the document.
</para>
<para>
Textual search operators have existed in databases for years.
<productname>PostgreSQL</productname> has
<literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
<literal>ILIKE</literal> operators for textual data types, but they lack
many essential properties required by modern information systems:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
There is no linguistic support, even for English. Regular expressions
are not sufficient because they cannot easily handle derived words, e.g.,
<literal>satisfies</literal> and <literal>satisfy</literal>. You might
miss documents that contain <literal>satisfies</literal>, although you
probably would like to find them when searching for
<literal>satisfy</literal>. It is possible to use <literal>OR</literal>
to search for multiple derived forms, but this is tedious and error-prone
(some words can have several thousand derivatives).
</para>
</listitem>
<listitem>
<para>
They provide no ordering (ranking) of search results, which makes them
ineffective when thousands of matching documents are found.
</para>
</listitem>
<listitem>
<para>
They tend to be slow because there is no index support, so they must
process all documents for every search.
</para>
</listitem>
</itemizedlist>
<para>
Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
and an index saved for later rapid searching. Preprocessing includes:
</para>
<itemizedlist mark="none">
<listitem>
<para>
<emphasis>Parsing documents into <firstterm>tokens</firstterm></emphasis>. It is
useful to identify various classes of tokens, e.g., numbers, words,
complex words, email addresses, so that they can be processed
differently. In principle token classes depend on the specific
application, but for most purposes it is adequate to use a predefined
set of classes.
<productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to
perform this step. A standard parser is provided, and custom parsers
can be created for specific needs.
</para>
</listitem>
<listitem>
<para>
<emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>.
A lexeme is a string, just like a token, but it has been
<firstterm>normalized</firstterm> so that different forms of the same word
are made alike. For example, normalization almost always includes
folding upper-case letters to lower-case, and often involves removal
of suffixes (such as <literal>s</literal> or <literal>es</literal> in English).
This allows searches to find variant forms of the
same word, without tediously entering all the possible variants.
Also, this step typically