Manipulating tsvector Documents: Concatenation, Weighting, and Length

<sect2 id="textsearch-manipulate-tsvector"> <title>Manipulating Documents</title> <para> <xref linkend="textsearch-parsing-documents"/> showed how raw textual documents can be converted into <type>tsvector</type> values. <productname>PostgreSQL</productname> also provides functions and operators that can be used to manipulate documents that are already in <type>tsvector</type> form. </para> <variablelist> <varlistentry> <term> <indexterm> <primary>tsvector concatenation</primary> </indexterm> <literal><type>tsvector</type> || <type>tsvector</type></literal> </term> <listitem> <para> The <type>tsvector</type> concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions and weight labels are retained during the concatenation. Positions appearing in the right-hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to the result of performing <function>to_tsvector</function> on the concatenation of the two original document strings. (The equivalence is not exact, because any stop-words removed from the end of the left-hand argument will not affect the result, whereas they would have affected the positions of the lexemes in the right-hand argument if textual concatenation were used.) </para> <para> One advantage of using concatenation in the vector form, rather than concatenating text before applying <function>to_tsvector</function>, is that you can use different configurations to parse different sections of the document. Also, because the <function>setweight</function> function marks all lexemes of the given vector the same way, it is necessary to parse the text and do <function>setweight</function> before concatenating if you want to label different parts of the document with different weights. </para> </listitem> </varlistentry> <varlistentry> <term> <indexterm> <primary>setweight</primary> </indexterm> <literal>setweight(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">weight</replaceable> <type>"char"</type>) returns <type>tsvector</type></literal> </term> <listitem> <para> <function>setweight</function> returns a copy of the input vector in which every position has been labeled with the given <replaceable>weight</replaceable>, either <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>. (<literal>D</literal> is the default for new vectors and as such is not displayed on output.) These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions. </para> <para> Note that weight labels apply to <emphasis>positions</emphasis>, not <emphasis>lexemes</emphasis>. If the input vector has been stripped of positions then <function>setweight</function> does nothing. </para> </listitem> </varlistentry> <varlistentry> <term> <indexterm> <primary>length(tsvector)</primary> </indexterm> <literal>length(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>integer</type></literal> </term> <listitem> <para> Returns the number of lexemes stored in the vector. </para> </listitem> </varlistentry> <varlistentry> <term> <indexterm> <primary>strip</primary> </indexterm> <literal>strip(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>tsvector</type></literal>

PostgreSQL provides functions and operators to manipulate tsvector documents. The concatenation operator (||) combines lexemes and positional information of two tsvectors, offsetting positions in the right-hand vector. Using vector concatenation allows parsing different document sections with different configurations. The setweight function labels each position in a tsvector with a weight (A, B, C, or D) that can be used by ranking functions. The length function returns the number of lexemes in a tsvector.