Custom Triggers, Configuration Importance, and Gathering Document Statistics

example, to weight title differently from body — it is necessary to write a custom trigger. Here is an example using <application>PL/pgSQL</application> as the trigger language: <programlisting> CREATE FUNCTION messages_trigger() RETURNS trigger AS $$ begin new.tsv := setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') || setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D'); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger(); </programlisting> </para> <para> Keep in mind that it is important to specify the configuration name explicitly when creating <type>tsvector</type> values inside triggers, so that the column's contents will not be affected by changes to <varname>default_text_search_config</varname>. Failure to do this is likely to lead to problems such as search results changing after a dump and restore. </para> </sect2> <sect2 id="textsearch-statistics"> <title>Gathering Document Statistics</title> <indexterm> <primary>ts_stat</primary> </indexterm> <para> The function <function>ts_stat</function> is useful for checking your configuration and for finding stop-word candidates. </para> <synopsis> ts_stat(<replaceable class="parameter">sqlquery</replaceable> <type>text</type>, <optional> <replaceable class="parameter">weights</replaceable> <type>text</type>, </optional> OUT <replaceable class="parameter">word</replaceable> <type>text</type>, OUT <replaceable class="parameter">ndoc</replaceable> <type>integer</type>, OUT <replaceable class="parameter">nentry</replaceable> <type>integer</type>) returns <type>setof record</type> </synopsis> <para> <replaceable>sqlquery</replaceable> is a text value containing an SQL query which must return a single <type>tsvector</type> column. <function>ts_stat</function> executes the query and returns statistics about each distinct lexeme (word) contained in the <type>tsvector</type> data. The columns returned are <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> <replaceable>word</replaceable> <type>text</type> — the value of a lexeme </para> </listitem> <listitem> <para> <replaceable>ndoc</replaceable> <type>integer</type> — number of documents (<type>tsvector</type>s) the word occurred in </para> </listitem> <listitem> <para> <replaceable>nentry</replaceable> <type>integer</type> — total number of occurrences of the word </para> </listitem> </itemizedlist> If <replaceable>weights</replaceable> is supplied, only occurrences having one of those weights are counted. </para> <para> For example, to find the ten most frequent words in a document collection: <programlisting> SELECT * FROM ts_stat('SELECT vector FROM apod') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10; </programlisting> The same, but counting only word occurrences with weight <literal>A</literal> or <literal>B</literal>: <programlisting> SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10; </programlisting> </para> </sect2> </sect1> <sect1 id="textsearch-parsers"> <title>Parsers</title> <para> Text search parsers are responsible for splitting raw document text into <firstterm>tokens</firstterm> and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At present <productname>PostgreSQL</productname> provides just

This section first demonstrates how to create custom triggers to weight title differently from body, using PL/pgSQL. Then it emphasizes the importance of explicitly specifying the configuration name when creating tsvector values inside triggers to avoid issues with search results changing after a dump and restore. Finally, it introduces the ts_stat function for checking text search configurations and finding stop-word candidates. It describes the function's parameters and the columns returned, providing examples of its usage to find the most frequent words in a document collection, optionally filtering by weight.