Home Explore Blog CI



postgresql

10th chunk of `doc/src/sgml/textsearch.sgml`
7b70478db524e5dad4d3c644414eb7ad5c4330ce97b295650000000100000fa0
 id="textsearch-parsing-documents">
   <title>Parsing Documents</title>

   <para>
    <productname>PostgreSQL</productname> provides the
    function <function>to_tsvector</function> for converting a document to
    the <type>tsvector</type> data type.
   </para>

   <indexterm>
    <primary>to_tsvector</primary>
   </indexterm>

<synopsis>
to_tsvector(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>) returns <type>tsvector</type>
</synopsis>

   <para>
    <function>to_tsvector</function> parses a textual document into tokens,
    reduces the tokens to lexemes, and returns a <type>tsvector</type> which
    lists the lexemes together with their positions in the document.
    The document is processed according to the specified or default
    text search configuration.
    Here is a simple example:

<screen>
SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                  to_tsvector
-----------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
</screen>
   </para>

   <para>
    In the example above we see that the resulting <type>tsvector</type> does not
    contain the words <literal>a</literal>, <literal>on</literal>, or
    <literal>it</literal>, the word <literal>rats</literal> became
    <literal>rat</literal>, and the punctuation sign <literal>-</literal> was
    ignored.
   </para>

   <para>
    The <function>to_tsvector</function> function internally calls a parser
    which breaks the document text into tokens and assigns a type to
    each token.  For each token, a list of
    dictionaries (<xref linkend="textsearch-dictionaries"/>) is consulted,
    where the list can vary depending on the token type.  The first dictionary
    that <firstterm>recognizes</firstterm> the token emits one or more normalized
    <firstterm>lexemes</firstterm> to represent the token.  For example,
    <literal>rats</literal> became <literal>rat</literal> because one of the
    dictionaries recognized that the word <literal>rats</literal> is a plural
    form of <literal>rat</literal>.  Some words are recognized as
    <firstterm>stop words</firstterm> (<xref linkend="textsearch-stopwords"/>), which
    causes them to be ignored since they occur too frequently to be useful in
    searching.  In our example these are
    <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
    If no dictionary in the list recognizes the token then it is also ignored.
    In this example that happened to the punctuation sign <literal>-</literal>
    because there are in fact no dictionaries assigned for its token type
    (<literal>Space symbols</literal>), meaning space tokens will never be
    indexed. The choices of parser, dictionaries and which types of tokens to
    index are determined by the selected text search configuration (<xref
    linkend="textsearch-configuration"/>).  It is possible to have
    many different configurations in the same database, and predefined
    configurations are available for various languages. In our example
    we used the default configuration <literal>english</literal> for the
    English language.
   </para>

   <para>
    The function <function>setweight</function> can be used to label the
    entries of a <type>tsvector</type> with a given <firstterm>weight</firstterm>,
    where a weight is one of the letters <literal>A</literal>, <literal>B</literal>,
    <literal>C</literal>, or <literal>D</literal>.
    This is typically used to mark entries coming from
    different parts of a document, such as title versus body.  Later, this
    information can be used for ranking of search results.
   </para>

   <para>
    Because <function>to_tsvector</function>(<literal>NULL</literal>) will
    return <literal>NULL</literal>, it is recommended to use
    <function>coalesce</function> whenever a field might

Title: Using to_tsvector for Document Parsing
Summary
PostgreSQL's `to_tsvector` function converts a text document into a `tsvector` data type. It parses the document into tokens, reduces them to lexemes, and lists them with their positions, based on the specified text search configuration. The process involves using a parser to break the document into tokens, consulting dictionaries for normalization, identifying and ignoring stop words, and determining which token types to index. The `setweight` function can label tsvector entries with weights (A, B, C, D) for ranking search results, and `coalesce` is recommended for handling potentially NULL fields.