Introduction to Full Text Search

<chapter id="textsearch"> <title>Full Text Search</title> <indexterm zone="textsearch"> <primary>full text search</primary> </indexterm> <indexterm zone="textsearch"> <primary>text search</primary> </indexterm> <sect1 id="textsearch-intro"> <title>Introduction</title> <para> Full Text Searching (or just <firstterm>text search</firstterm>) provides the capability to identify natural-language <firstterm>documents</firstterm> that satisfy a <firstterm>query</firstterm>, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given <firstterm>query terms</firstterm> and return them in order of their <firstterm>similarity</firstterm> to the query. Notions of <varname>query</varname> and <varname>similarity</varname> are very flexible and depend on the specific application. The simplest search considers <varname>query</varname> as a set of words and <varname>similarity</varname> as the frequency of query words in the document. </para> <para> Textual search operators have existed in databases for years. <productname>PostgreSQL</productname> has <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and <literal>ILIKE</literal> operators for textual data types, but they lack many essential properties required by modern information systems: </para> <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g., <literal>satisfies</literal> and <literal>satisfy</literal>. You might miss documents that contain <literal>satisfies</literal>, although you probably would like to find them when searching for <literal>satisfy</literal>. It is possible to use <literal>OR</literal> to search for multiple derived forms, but this is tedious and error-prone (some words can have several thousand derivatives). </para> </listitem> <listitem> <para> They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found. </para> </listitem> <listitem> <para> They tend to be slow because there is no index support, so they must process all documents for every search. </para> </listitem> </itemizedlist> <para> Full text indexing allows documents to be <emphasis>preprocessed</emphasis> and an index saved for later rapid searching. Preprocessing includes: </para> <itemizedlist mark="none"> <listitem> <para> <emphasis>Parsing documents into <firstterm>tokens</firstterm></emphasis>. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes. <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to perform this step. A standard parser is provided, and custom parsers can be created for specific needs. </para> </listitem> <listitem> <para> <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>. A lexeme is a string, just like a token, but it has been <firstterm>normalized</firstterm> so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as <literal>s</literal> or <literal>es</literal> in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically

This chapter introduces Full Text Search (text search), which enables identifying and sorting natural-language documents based on their relevance to a query. Traditional database text operators lack linguistic support, ranking capabilities, and indexing for speed. Full text indexing preprocesses documents by parsing them into tokens and converting tokens into normalized lexemes for efficient and accurate searching.