Home Explore Blog CI



postgresql

27th chunk of `doc/src/sgml/textsearch.sgml`
129d682691191350281698735086c3a0d866ead2716108eb0000000100000fbf

<programlisting>
SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
</programlisting>

    The same, but counting only word occurrences with weight <literal>A</literal>
    or <literal>B</literal>:

<programlisting>
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
</programlisting>
   </para>

  </sect2>

 </sect1>

 <sect1 id="textsearch-parsers">
  <title>Parsers</title>

  <para>
   Text search parsers are responsible for splitting raw document text
   into <firstterm>tokens</firstterm> and identifying each token's type, where
   the set of possible types is defined by the parser itself.
   Note that a parser does not modify the text at all &mdash; it simply
   identifies plausible word boundaries.  Because of this limited scope,
   there is less need for application-specific custom parsers than there is
   for custom dictionaries.  At present <productname>PostgreSQL</productname>
   provides just one built-in parser, which has been found to be useful for a
   wide range of applications.
  </para>

  <para>
   The built-in parser is named <literal>pg_catalog.default</literal>.
   It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser"/>.
  </para>

  <table id="textsearch-default-parser">
   <title>Default Parser's Token Types</title>
   <tgroup cols="3">
    <colspec colname="col1" colwidth="2*"/>
    <colspec colname="col2" colwidth="2*"/>
    <colspec colname="col3" colwidth="3*"/>
    <thead>
     <row>
      <entry>Alias</entry>
      <entry>Description</entry>
      <entry>Example</entry>
     </row>
    </thead>
    <tbody>
     <row>
      <entry><literal>asciiword</literal></entry>
      <entry>Word, all ASCII letters</entry>
      <entry><literal>elephant</literal></entry>
     </row>
     <row>
      <entry><literal>word</literal></entry>
      <entry>Word, all letters</entry>
      <entry><literal>ma&ntilde;ana</literal></entry>
     </row>
     <row>
      <entry><literal>numword</literal></entry>
      <entry>Word, letters and digits</entry>
      <entry><literal>beta1</literal></entry>
     </row>
     <row>
      <entry><literal>asciihword</literal></entry>
      <entry>Hyphenated word, all ASCII</entry>
      <entry><literal>up-to-date</literal></entry>
     </row>
     <row>
      <entry><literal>hword</literal></entry>
      <entry>Hyphenated word, all letters</entry>
      <entry><literal>l&oacute;gico-matem&aacute;tica</literal></entry>
     </row>
     <row>
      <entry><literal>numhword</literal></entry>
      <entry>Hyphenated word, letters and digits</entry>
      <entry><literal>postgresql-beta1</literal></entry>
     </row>
     <row>
      <entry><literal>hword_asciipart</literal></entry>
      <entry>Hyphenated word part, all ASCII</entry>
      <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
     </row>
     <row>
      <entry><literal>hword_part</literal></entry>
      <entry>Hyphenated word part, all letters</entry>
      <entry><literal>l&oacute;gico</literal> or <literal>matem&aacute;tica</literal>
       in the context <literal>l&oacute;gico-matem&aacute;tica</literal></entry>
     </row>
     <row>
      <entry><literal>hword_numpart</literal></entry>
      <entry>Hyphenated word part, letters and digits</entry>
      <entry><literal>beta1</literal> in the context
       <literal>postgresql-beta1</literal></entry>
     </row>
     <row>
      <entry><literal>email</literal></entry>
      <entry>Email address</entry>
      <entry><literal>foo@example.com</literal></entry>
     </row>
     <row>
      <entry><literal>protocol</literal></entry>
      <entry>Protocol head</entry>
      <entry><literal>http://</literal></entry>
     </row>
     <row>
      <entry><literal>url</literal></entry>
      <entry>URL</entry>
      <entry><literal>example.com/stuff/index.html</literal></entry>
     </row>
     <row>
      <entry><literal>host</literal></entry>

Title: Text Search Parsers in PostgreSQL
Summary
This section discusses text search parsers, which split raw document text into tokens and identify each token's type. The built-in parser in PostgreSQL, 'pg_catalog.default', recognizes 23 token types, including asciiword, word, numword, hyphenated words, email, protocol, URL, host, and various numeric and special character types. A table provides descriptions and examples for each token type.