Home Explore Blog CI



postgresql

29th chunk of `doc/src/sgml/textsearch.sgml`
cc6154fb85599aa4abe26364bee5a47259c38ed04a5e114a0000000100000fa0
 reported as a separate token type,
    since it is sometimes useful to distinguish them.  In most European
    languages, token types <literal>word</literal> and <literal>asciiword</literal>
    should be treated alike.
   </para>

   <para>
    <literal>email</literal> does not support all valid email characters as
    defined by <ulink url="https://datatracker.ietf.org/doc/html/rfc5322">RFC 5322</ulink>.
    Specifically, the only non-alphanumeric characters supported for
    email user names are period, dash, and underscore.
   </para>

   <para>
    <literal>tag</literal> does not support all valid tag names as defined by
    <ulink url="https://www.w3.org/TR/xml/">W3C Recommendation, XML</ulink>.
    Specifically, the only tag names supported are those starting with an
    ASCII letter, underscore, or colon, and containing only letters, digits,
    hyphens, underscores, periods, and colons. <literal>tag</literal> also
    includes XML comments starting with <literal>&lt;!--</literal> and ending
    with <literal>--&gt;</literal>, and XML declarations (but note that this
    includes anything starting with <literal>&lt;?x</literal> and ending with
    <literal>&gt;</literal>).
   </para>
  </note>

  <para>
   It is possible for the parser to produce overlapping tokens from the same
   piece of text.  As an example, a hyphenated word will be reported both
   as the entire word and as each component:

<screen>
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
      alias      |               description                |     token
-----------------+------------------------------------------+---------------
 numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
 hword_asciipart | Hyphenated word part, all ASCII          | foo
 blank           | Space symbols                            | -
 hword_asciipart | Hyphenated word part, all ASCII          | bar
 blank           | Space symbols                            | -
 hword_numpart   | Hyphenated word part, letters and digits | beta1
</screen>

   This behavior is desirable since it allows searches to work for both
   the whole compound word and for components.  Here is another
   instructive example:

<screen>
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
  alias   |  description  |            token
----------+---------------+------------------------------
 protocol | Protocol head | http://
 url      | URL           | example.com/stuff/index.html
 host     | Host          | example.com
 url_path | URL path      | /stuff/index.html
</screen>
  </para>

 </sect1>

 <sect1 id="textsearch-dictionaries">
  <title>Dictionaries</title>

  <para>
   Dictionaries are used to eliminate words that should not be considered in a
   search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so
   that different derived forms of the same word will match.  A successfully
   normalized word is called a <firstterm>lexeme</firstterm>.  Aside from
   improving search quality, normalization and removal of stop words reduce the
   size of the <type>tsvector</type> representation of a document, thereby
   improving performance.  Normalization does not always have linguistic meaning
   and usually depends on application semantics.
  </para>

  <para>
   Some examples of normalization:

   <itemizedlist  spacing="compact" mark="bullet">

    <listitem>
     <para>
      Linguistic &mdash; Ispell dictionaries try to reduce input words to a
      normalized form; stemmer dictionaries remove word endings
     </para>
    </listitem>
    <listitem>
     <para>
      <acronym>URL</acronym> locations can be canonicalized to make
      equivalent URLs match:

      <itemizedlist  spacing="compact" mark="bullet">
       <listitem>
        <para>
         http://www.pgsql.ru/db/mw/index.html
        </para>
       </listitem>
       <listitem>
        <para>
         http://www.pgsql.ru/db/mw/
       

Title: Overlapping Tokens and Dictionary Introduction
Summary
The parser can produce overlapping tokens from the same text, such as for hyphenated words and URLs, allowing searches for both the whole compound and its components. The section then transitions to discussing dictionaries in PostgreSQL, which are used to eliminate stop words and normalize words into lexemes, improving search quality, reducing tsvector size, and enhancing performance. Normalization can be linguistic or application-specific, such as canonicalizing URLs.