Home Explore Blog CI



postgresql

30th chunk of `doc/src/sgml/textsearch.sgml`
8367cb56701ba8207c42743cd9b0744ab40a5e40c59373290000000100000fa6
 <firstterm>lexeme</firstterm>.  Aside from
   improving search quality, normalization and removal of stop words reduce the
   size of the <type>tsvector</type> representation of a document, thereby
   improving performance.  Normalization does not always have linguistic meaning
   and usually depends on application semantics.
  </para>

  <para>
   Some examples of normalization:

   <itemizedlist  spacing="compact" mark="bullet">

    <listitem>
     <para>
      Linguistic &mdash; Ispell dictionaries try to reduce input words to a
      normalized form; stemmer dictionaries remove word endings
     </para>
    </listitem>
    <listitem>
     <para>
      <acronym>URL</acronym> locations can be canonicalized to make
      equivalent URLs match:

      <itemizedlist  spacing="compact" mark="bullet">
       <listitem>
        <para>
         http://www.pgsql.ru/db/mw/index.html
        </para>
       </listitem>
       <listitem>
        <para>
         http://www.pgsql.ru/db/mw/
        </para>
       </listitem>
       <listitem>
        <para>
         http://www.pgsql.ru/db/../db/mw/index.html
        </para>
       </listitem>
      </itemizedlist>
     </para>
    </listitem>
    <listitem>
     <para>
      Color names can be replaced by their hexadecimal values, e.g.,
      <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
     </para>
    </listitem>
    <listitem>
     <para>
      If indexing numbers, we can
      remove some fractional digits to reduce the range of possible
      numbers, so for example <emphasis>3.14</emphasis>159265359,
      <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
      after normalization if only two digits are kept after the decimal point.
     </para>
    </listitem>
   </itemizedlist>

  </para>

  <para>
   A dictionary is a program that accepts a token as
   input and returns:
   <itemizedlist  spacing="compact" mark="bullet">
    <listitem>
     <para>
      an array of lexemes if the input token is known to the dictionary
      (notice that one token can produce more than one lexeme)
     </para>
    </listitem>
    <listitem>
     <para>
      a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace
      the original token with a new token to be passed to subsequent
      dictionaries (a dictionary that does this is called a
      <firstterm>filtering dictionary</firstterm>)
     </para>
    </listitem>
    <listitem>
     <para>
      an empty array if the dictionary knows the token, but it is a stop word
     </para>
    </listitem>
    <listitem>
     <para>
      <literal>NULL</literal> if the dictionary does not recognize the input token
     </para>
    </listitem>
   </itemizedlist>
  </para>

  <para>
   <productname>PostgreSQL</productname> provides predefined dictionaries for
   many languages.  There are also several predefined templates that can be
   used to create new dictionaries with custom parameters.  Each predefined
   dictionary template is described below.  If no existing
   template is suitable, it is possible to create new ones; see the
   <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution
   for examples.
  </para>

  <para>
   A text search configuration binds a parser together with a set of
   dictionaries to process the parser's output tokens.  For each token
   type that the parser can return, a separate list of dictionaries is
   specified by the configuration.  When a token of that type is found
   by the parser, each dictionary in the list is consulted in turn,
   until some dictionary recognizes it as a known word.  If it is identified
   as a stop word, or if no dictionary recognizes the token, it will be
   discarded and not indexed or searched for.
   Normally, the first dictionary that returns a non-<literal>NULL</literal>
   output determines the result, and any remaining dictionaries are not
   consulted; but a filtering dictionary

Title: Normalization Examples and Dictionary Functionality
Summary
The text provides examples of normalization, including linguistic normalization with Ispell and stemmers, URL canonicalization, color name replacement with hexadecimal values, and fractional digit removal. It explains that a dictionary takes a token as input and can return an array of lexemes, a single lexeme with the TSL_FILTER flag, an empty array for stop words, or NULL if the token is not recognized. PostgreSQL offers predefined dictionaries and templates, and custom templates can be created. A text search configuration combines a parser with dictionaries, processing tokens and discarding them if identified as stop words or unrecognized.