Documents in PostgreSQL Text Search and Basic Text Matching

<type>tsvector</type> is provided for storing preprocessed documents, along with a type <type>tsquery</type> for representing processed queries (<xref linkend="datatype-textsearch"/>). There are many functions and operators available for these data types (<xref linkend="functions-textsearch"/>), the most important of which is the match operator <literal>@@</literal>, which we introduce in <xref linkend="textsearch-matching"/>. Full text searches can be accelerated using indexes (<xref linkend="textsearch-indexes"/>). </para> <sect2 id="textsearch-document"> <title>What Is a Document?</title> <indexterm zone="textsearch-document"> <primary>document</primary> <secondary>text search</secondary> </indexterm> <para> A <firstterm>document</firstterm> is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words. </para> <para> For searches within <productname>PostgreSQL</productname>, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example: <programlisting> SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document FROM messages WHERE mid = 12; SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document FROM messages m, docs d WHERE m.mid = d.did AND m.mid = 12; </programlisting> </para> <note> <para> Actually, in these example queries, <function>coalesce</function> should be used to prevent a single <literal>NULL</literal> attribute from causing a <literal>NULL</literal> result for the whole document. </para> </note> <para> Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data inside <productname>PostgreSQL</productname>. Also, keeping everything inside the database allows easy access to document metadata to assist in indexing and display. </para> <para> For text search purposes, each document must be reduced to the preprocessed <type>tsvector</type> format. Searching and ranking are performed entirely on the <type>tsvector</type> representation of a document — the original text need only be retrieved when the document has been selected for display to a user. We therefore often speak of the <type>tsvector</type> as being the document, but of course it is only a compact representation of the full document. </para> </sect2> <sect2 id="textsearch-matching"> <title>Basic Text Matching</title> <para> Full text searching in <productname>PostgreSQL</productname> is based on the match operator <literal>@@</literal>, which returns <literal>true</literal> if a <type>tsvector</type> (document) matches a <type>tsquery</type> (query). It doesn't matter which data type is written first: <programlisting> SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery; ?column? ---------- t SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- f </programlisting> </para>

PostgreSQL uses tsvector and tsquery data types for storing preprocessed documents and representing queries, respectively, along with functions and operators like the match operator @@. A document, the unit of searching, can be a textual field within a database table, a combination of fields, or a text file. The database can store the full-text index, using a unique identifier to retrieve the document from the file system. The tsvector format is a compact representation of the document used for searching and ranking. The match operator @@ returns true if a tsvector matches a tsquery.