Creating Text Search Indexes in PostgreSQL

use the configuration set by <xref linkend="guc-default-text-search-config"/>. </para> <para> A more complex example is to select the ten most recent documents that contain <literal>create</literal> and <literal>table</literal> in the <structname>title</structname> or <structname>body</structname>: <programlisting> SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10; </programlisting> For clarity we omitted the <function>coalesce</function> function calls which would be needed to find rows that contain <literal>NULL</literal> in one of the two fields. </para> <para> Although these queries will work without an index, most applications will find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index. </para> </sect2> <sect2 id="textsearch-tables-index"> <title>Creating Indexes</title> <para> We can create a <acronym>GIN</acronym> index (<xref linkend="textsearch-indexes"/>) to speed up text searches: <programlisting> CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body)); </programlisting> Notice that the 2-argument version of <function>to_tsvector</function> is used. Only text search functions that specify a configuration name can be used in expression indexes (<xref linkend="indexes-expressional"/>). This is because the index contents must be unaffected by <xref linkend="guc-default-text-search-config"/>. If they were affected, the index contents might be inconsistent because different entries could contain <type>tsvector</type>s that were created with different text search configurations, and there would be no way to guess which was which. It would be impossible to dump and restore such an index correctly. </para> <para> Because the two-argument version of <function>to_tsvector</function> was used in the index above, only a query reference that uses the 2-argument version of <function>to_tsvector</function> with the same configuration name will use that index. That is, <literal>WHERE to_tsvector('english', body) @@ 'a & b'</literal> can use the index, but <literal>WHERE to_tsvector(body) @@ 'a & b'</literal> cannot. This ensures that an index will be used only with the same configuration used to create the index entries. </para> <para> It is possible to set up more complex expression indexes wherein the configuration name is specified by another column, e.g.: <programlisting> CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body)); </programlisting> where <literal>config_name</literal> is a column in the <literal>pgweb</literal> table. This allows mixed configurations in the same index while recording which configuration was used for each index entry. This would be useful, for example, if the document collection contained documents in different languages. Again, queries that are meant to use the index must be phrased to match, e.g., <literal>WHERE to_tsvector(config_name, body) @@ 'a & b'</literal>. </para> <para> Indexes can even concatenate columns: <programlisting> CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body)); </programlisting> </para> <para> Another approach is to create a separate <type>tsvector</type> column to hold the output of <function>to_tsvector</function>. To keep this column automatically up to date with its source data, use a stored generated column. This example is a concatenation of <literal>title</literal> and <literal>body</literal>, using <function>coalesce</function> to ensure that one field will still be indexed when the other is <literal>NULL</literal>: <programlisting> ALTER TABLE pgweb

Creating GIN indexes speeds up text searches. Expression indexes must use the 2-argument version of to_tsvector, specifying the configuration. Indexes can be more complex, using a configuration name from another column or concatenating columns. Alternatively, a separate tsvector column can be maintained automatically using a stored generated column.