Home Explore Blog CI



postgresql

1st chunk of `doc/src/sgml/textsearch.sgml`
ee7b9ede14b8a0db9d56e975e4baf91bbf858c16097b0e380000000100000fa4
<!-- doc/src/sgml/textsearch.sgml -->

<chapter id="textsearch">
 <title>Full Text Search</title>

  <indexterm zone="textsearch">
   <primary>full text search</primary>
  </indexterm>

  <indexterm zone="textsearch">
   <primary>text search</primary>
  </indexterm>

 <sect1 id="textsearch-intro">
  <title>Introduction</title>

  <para>
   Full Text Searching (or just <firstterm>text search</firstterm>) provides
   the capability to identify natural-language <firstterm>documents</firstterm> that
   satisfy a <firstterm>query</firstterm>, and optionally to sort them by
   relevance to the query.  The most common type of search
   is to find all documents containing given <firstterm>query terms</firstterm>
   and return them in order of their <firstterm>similarity</firstterm> to the
   query.  Notions of <varname>query</varname> and
   <varname>similarity</varname> are very flexible and depend on the specific
   application. The simplest search considers <varname>query</varname> as a
   set of words and <varname>similarity</varname> as the frequency of query
   words in the document.
  </para>

  <para>
   Textual search operators have existed in databases for years.
   <productname>PostgreSQL</productname> has
   <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
   <literal>ILIKE</literal> operators for textual data types, but they lack
   many essential properties required by modern information systems:
  </para>

  <itemizedlist  spacing="compact" mark="bullet">
   <listitem>
    <para>
     There is no linguistic support, even for English.  Regular expressions
     are not sufficient because they cannot easily handle derived words, e.g.,
     <literal>satisfies</literal> and <literal>satisfy</literal>. You might
     miss documents that contain <literal>satisfies</literal>, although you
     probably would like to find them when searching for
     <literal>satisfy</literal>. It is possible to use <literal>OR</literal>
     to search for multiple derived forms, but this is tedious and error-prone
     (some words can have several thousand derivatives).
    </para>
   </listitem>

   <listitem>
    <para>
     They provide no ordering (ranking) of search results, which makes them
     ineffective when thousands of matching documents are found.
    </para>
   </listitem>

   <listitem>
    <para>
     They tend to be slow because there is no index support, so they must
     process all documents for every search.
    </para>
   </listitem>
  </itemizedlist>

  <para>
   Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
   and an index saved for later rapid searching. Preprocessing includes:
  </para>

  <itemizedlist  mark="none">
   <listitem>
    <para>
     <emphasis>Parsing documents into <firstterm>tokens</firstterm></emphasis>. It is
     useful to identify various classes of tokens, e.g., numbers, words,
     complex words, email addresses, so that they can be processed
     differently.  In principle token classes depend on the specific
     application, but for most purposes it is adequate to use a predefined
     set of classes.
     <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to
     perform this step.  A standard parser is provided, and custom parsers
     can be created for specific needs.
    </para>
   </listitem>

   <listitem>
    <para>
     <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>.
     A lexeme is a string, just like a token, but it has been
     <firstterm>normalized</firstterm> so that different forms of the same word
     are made alike.  For example, normalization almost always includes
     folding upper-case letters to lower-case, and often involves removal
     of suffixes (such as <literal>s</literal> or <literal>es</literal> in English).
     This allows searches to find variant forms of the
     same word, without tediously entering all the possible variants.
     Also, this step typically

Title: Introduction to Full Text Search
Summary
This chapter introduces Full Text Search (text search), which enables identifying and sorting natural-language documents based on their relevance to a query. Traditional database text operators lack linguistic support, ranking capabilities, and indexing for speed. Full text indexing preprocesses documents by parsing them into tokens and converting tokens into normalized lexemes for efficient and accurate searching.