id="textsearch-parsing-documents">
<title>Parsing Documents</title>
<para>
<productname>PostgreSQL</productname> provides the
function <function>to_tsvector</function> for converting a document to
the <type>tsvector</type> data type.
</para>
<indexterm>
<primary>to_tsvector</primary>
</indexterm>
<synopsis>
to_tsvector(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>) returns <type>tsvector</type>
</synopsis>
<para>
<function>to_tsvector</function> parses a textual document into tokens,
reduces the tokens to lexemes, and returns a <type>tsvector</type> which
lists the lexemes together with their positions in the document.
The document is processed according to the specified or default
text search configuration.
Here is a simple example:
<screen>
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
to_tsvector
-----------------------------------------------------
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
</screen>
</para>
<para>
In the example above we see that the resulting <type>tsvector</type> does not
contain the words <literal>a</literal>, <literal>on</literal>, or
<literal>it</literal>, the word <literal>rats</literal> became
<literal>rat</literal>, and the punctuation sign <literal>-</literal> was
ignored.
</para>
<para>
The <function>to_tsvector</function> function internally calls a parser
which breaks the document text into tokens and assigns a type to
each token. For each token, a list of
dictionaries (<xref linkend="textsearch-dictionaries"/>) is consulted,
where the list can vary depending on the token type. The first dictionary
that <firstterm>recognizes</firstterm> the token emits one or more normalized
<firstterm>lexemes</firstterm> to represent the token. For example,
<literal>rats</literal> became <literal>rat</literal> because one of the
dictionaries recognized that the word <literal>rats</literal> is a plural
form of <literal>rat</literal>. Some words are recognized as
<firstterm>stop words</firstterm> (<xref linkend="textsearch-stopwords"/>), which
causes them to be ignored since they occur too frequently to be useful in
searching. In our example these are
<literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
If no dictionary in the list recognizes the token then it is also ignored.
In this example that happened to the punctuation sign <literal>-</literal>
because there are in fact no dictionaries assigned for its token type
(<literal>Space symbols</literal>), meaning space tokens will never be
indexed. The choices of parser, dictionaries and which types of tokens to
index are determined by the selected text search configuration (<xref
linkend="textsearch-configuration"/>). It is possible to have
many different configurations in the same database, and predefined
configurations are available for various languages. In our example
we used the default configuration <literal>english</literal> for the
English language.
</para>
<para>
The function <function>setweight</function> can be used to label the
entries of a <type>tsvector</type> with a given <firstterm>weight</firstterm>,
where a weight is one of the letters <literal>A</literal>, <literal>B</literal>,
<literal>C</literal>, or <literal>D</literal>.
This is typically used to mark entries coming from
different parts of a document, such as title versus body. Later, this
information can be used for ranking of search results.
</para>
<para>
Because <function>to_tsvector</function>(<literal>NULL</literal>) will
return <literal>NULL</literal>, it is recommended to use
<function>coalesce</function> whenever a field might