Home Explore Blog CI



postgresql

51th chunk of `doc/src/sgml/datatype.sgml`
f62d9f577d1006718e26df90b4cc94fe0a83df44fdc498370000000100000fa6
 <firstterm>documents</firstterm>
    to locate those that best match a <firstterm>query</firstterm>.
    The <type>tsvector</type> type represents a document in a form optimized
    for text search; the <type>tsquery</type> type similarly represents
    a text query.
    <xref linkend="textsearch"/> provides a detailed explanation of this
    facility, and <xref linkend="functions-textsearch"/> summarizes the
    related functions and operators.
   </para>

   <sect2 id="datatype-tsvector">
    <title><type>tsvector</type></title>

    <indexterm>
     <primary>tsvector (data type)</primary>
    </indexterm>

    <para>
     A <type>tsvector</type> value is a sorted list of distinct
     <firstterm>lexemes</firstterm>, which are words that have been
     <firstterm>normalized</firstterm> to merge different variants of the same word
     (see <xref linkend="textsearch"/> for details).  Sorting and
     duplicate-elimination are done automatically during input, as shown in
     this example:

<programlisting>
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
</programlisting>

     To represent
     lexemes containing whitespace or punctuation, surround them with quotes:

<programlisting>
SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector
-------------------------------------------
 '    ' 'contains' 'lexeme' 'spaces' 'the'
</programlisting>

     (We use dollar-quoted string literals in this example and the next one
     to avoid the confusion of having to double quote marks within the
     literals.)  Embedded quotes and backslashes must be doubled:

<programlisting>
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector
------------------------------------------------
 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
</programlisting>

     Optionally, integer <firstterm>positions</firstterm>
     can be attached to lexemes:

<programlisting>
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
-------------------------------------------------------------------&zwsp;------------
 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
</programlisting>

     A position normally indicates the source word's location in the
     document.  Positional information can be used for
     <firstterm>proximity ranking</firstterm>.  Position values can
     range from 1 to 16383; larger numbers are silently set to 16383.
     Duplicate positions for the same lexeme are discarded.
    </para>

    <para>
     Lexemes that have positions can further be labeled with a
     <firstterm>weight</firstterm>, which can be <literal>A</literal>,
     <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>.
     <literal>D</literal> is the default and hence is not shown on output:

<programlisting>
SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector
----------------------------
 'a':1A 'cat':5 'fat':2B,4C
</programlisting>

     Weights are typically used to reflect document structure, for example
     by marking title words differently from body words.  Text search
     ranking functions can assign different priorities to the different
     weight markers.
    </para>

    <para>
     It is important to understand that the
     <type>tsvector</type> type itself does not perform any word
     normalization; it assumes the words it is given are normalized
     appropriately for the application.  For example,

<programlisting>
SELECT 'The Fat Rats'::tsvector;
      tsvector
--------------------
 'Fat' 'Rats' 'The'
</programlisting>

     For most English-text-searching applications the above words would
     be considered non-normalized, but <type>tsvector</type> doesn't care.
     Raw document text should usually be passed through

Title: TSVector Data Type
Summary
The tsvector data type represents a document in a form optimized for text search, consisting of a sorted list of distinct lexemes. Lexemes can be normalized words, and optionally, can include positions and weights to provide additional information for ranking and proximity search. The tsvector type does not perform word normalization itself, relying on the input to be appropriately normalized for the application.