Home Explore Blog CI



postgresql

3rd chunk of `doc/src/sgml/textsearch.sgml`
3789c3fb0bcab1fcf78ace40ae269be8e08a4c2ba74a9ca90000000100000fa5
 <type>tsvector</type> is provided for storing preprocessed
   documents, along with a type <type>tsquery</type> for representing processed
   queries (<xref linkend="datatype-textsearch"/>).  There are many
   functions and operators available for these data types
   (<xref linkend="functions-textsearch"/>), the most important of which is
   the match operator <literal>@@</literal>, which we introduce in
   <xref linkend="textsearch-matching"/>.  Full text searches can be accelerated
   using indexes (<xref linkend="textsearch-indexes"/>).
  </para>


  <sect2 id="textsearch-document">
   <title>What Is a Document?</title>

   <indexterm zone="textsearch-document">
    <primary>document</primary>
    <secondary>text search</secondary>
   </indexterm>

   <para>
    A <firstterm>document</firstterm> is the unit of searching in a full text search
    system; for example, a magazine article or email message.  The text search
    engine must be able to parse documents and store associations of lexemes
    (key words) with their parent document. Later, these associations are
    used to search for documents that contain query words.
   </para>

   <para>
    For searches within <productname>PostgreSQL</productname>,
    a document is normally a textual field within a row of a database table,
    or possibly a combination (concatenation) of such fields, perhaps stored
    in several tables or obtained dynamically. In other words, a document can
    be constructed from different parts for indexing and it might not be
    stored anywhere as a whole. For example:

<programlisting>
SELECT title || ' ' ||  author || ' ' ||  abstract || ' ' || body AS document
FROM messages
WHERE mid = 12;

SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
FROM messages m, docs d
WHERE m.mid = d.did AND m.mid = 12;
</programlisting>
   </para>

   <note>
    <para>
     Actually, in these example queries, <function>coalesce</function>
     should be used to prevent a single <literal>NULL</literal> attribute from
     causing a <literal>NULL</literal> result for the whole document.
    </para>
   </note>

   <para>
    Another possibility is to store the documents as simple text files in the
    file system. In this case, the database can be used to store the full text
    index and to execute searches, and some unique identifier can be used to
    retrieve the document from the file system.  However, retrieving files
    from outside the database requires superuser permissions or special
    function support, so this is usually less convenient than keeping all
    the data inside <productname>PostgreSQL</productname>.  Also, keeping
    everything inside the database allows easy access
    to document metadata to assist in indexing and display.
   </para>

   <para>
    For text search purposes, each document must be reduced to the
    preprocessed <type>tsvector</type> format.  Searching and ranking
    are performed entirely on the <type>tsvector</type> representation
    of a document &mdash; the original text need only be retrieved
    when the document has been selected for display to a user.
    We therefore often speak of the <type>tsvector</type> as being the
    document, but of course it is only a compact representation of
    the full document.
   </para>
  </sect2>

  <sect2 id="textsearch-matching">
   <title>Basic Text Matching</title>

   <para>
    Full text searching in <productname>PostgreSQL</productname> is based on
    the match operator <literal>@@</literal>, which returns
    <literal>true</literal> if a <type>tsvector</type>
    (document) matches a <type>tsquery</type> (query).
    It doesn't matter which data type is written first:

<programlisting>
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
 ?column?
----------
 t

SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 ?column?
----------
 f
</programlisting>
   </para>

Title: Documents in PostgreSQL Text Search and Basic Text Matching
Summary
PostgreSQL uses tsvector and tsquery data types for storing preprocessed documents and representing queries, respectively, along with functions and operators like the match operator @@. A document, the unit of searching, can be a textual field within a database table, a combination of fields, or a text file. The database can store the full-text index, using a unique identifier to retrieve the document from the file system. The tsvector format is a compact representation of the document used for searching and ranking. The match operator @@ returns true if a tsvector matches a tsquery.