Home Explore Blog CI



postgresql

2nd chunk of `doc/src/sgml/textsearch.sgml`
6c4ca25b5f2a9b2964a6265a0ae64031a5b98f89c5ffbed70000000100000fa0
 principle token classes depend on the specific
     application, but for most purposes it is adequate to use a predefined
     set of classes.
     <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to
     perform this step.  A standard parser is provided, and custom parsers
     can be created for specific needs.
    </para>
   </listitem>

   <listitem>
    <para>
     <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>.
     A lexeme is a string, just like a token, but it has been
     <firstterm>normalized</firstterm> so that different forms of the same word
     are made alike.  For example, normalization almost always includes
     folding upper-case letters to lower-case, and often involves removal
     of suffixes (such as <literal>s</literal> or <literal>es</literal> in English).
     This allows searches to find variant forms of the
     same word, without tediously entering all the possible variants.
     Also, this step typically eliminates <firstterm>stop words</firstterm>, which
     are words that are so common that they are useless for searching.
     (In short, then, tokens are raw fragments of the document text, while
     lexemes are words that are believed useful for indexing and searching.)
     <productname>PostgreSQL</productname> uses <firstterm>dictionaries</firstterm> to
     perform this step.  Various standard dictionaries are provided, and
     custom ones can be created for specific needs.
    </para>
   </listitem>

   <listitem>
    <para>
     <emphasis>Storing preprocessed documents optimized for
     searching</emphasis>.  For example, each document can be represented
     as a sorted array of normalized lexemes. Along with the lexemes it is
     often desirable to store positional information to use for
     <firstterm>proximity ranking</firstterm>, so that a document that
     contains a more <quote>dense</quote> region of query words is
     assigned a higher rank than one with scattered query words.
    </para>
   </listitem>
  </itemizedlist>

  <para>
   Dictionaries allow fine-grained control over how tokens are normalized.
   With appropriate dictionaries, you can:
  </para>

  <itemizedlist  spacing="compact" mark="bullet">
   <listitem>
    <para>
     Define stop words that should not be indexed.
    </para>
   </listitem>

   <listitem>
    <para>
     Map synonyms to a single word using <application>Ispell</application>.
    </para>
   </listitem>

   <listitem>
    <para>
     Map phrases to a single word using a thesaurus.
    </para>
   </listitem>

   <listitem>
    <para>
     Map different variations of a word to a canonical form using
     an <application>Ispell</application> dictionary.
    </para>
   </listitem>

   <listitem>
    <para>
     Map different variations of a word to a canonical form using
     <application>Snowball</application> stemmer rules.
    </para>
   </listitem>
  </itemizedlist>

  <para>
   A data type <type>tsvector</type> is provided for storing preprocessed
   documents, along with a type <type>tsquery</type> for representing processed
   queries (<xref linkend="datatype-textsearch"/>).  There are many
   functions and operators available for these data types
   (<xref linkend="functions-textsearch"/>), the most important of which is
   the match operator <literal>@@</literal>, which we introduce in
   <xref linkend="textsearch-matching"/>.  Full text searches can be accelerated
   using indexes (<xref linkend="textsearch-indexes"/>).
  </para>


  <sect2 id="textsearch-document">
   <title>What Is a Document?</title>

   <indexterm zone="textsearch-document">
    <primary>document</primary>
    <secondary>text search</secondary>
   </indexterm>

   <para>
    A <firstterm>document</firstterm> is the unit of searching in a full text search
    system; for example, a magazine article or email message.  The text search
    engine must be able to parse documents and store associations of lexemes
   

Title: Tokenization, Lexemes, and Document Preprocessing
Summary
Full text search involves preprocessing documents by parsing text into tokens and converting them into normalized lexemes using dictionaries, which helps in eliminating stop words and mapping synonyms. PostgreSQL offers standard and custom parsers/dictionaries for this process. Preprocessed documents are stored, optimized for searching, using the tsvector data type. The tsquery data type represents processed queries. Dictionaries facilitate fine-grained control over token normalization, including synonym mapping and stemming. The match operator @@ and indexes accelerate full text searches.