Text Search Configuration and Stop Words

predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see the <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution for examples. </para> <para> A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. Normally, the first dictionary that returns a non-<literal>NULL</literal> output determines the result, and any remaining dictionaries are not consulted; but a filtering dictionary can replace the given word with a modified word, which is then passed to subsequent dictionaries. </para> <para> The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a <application>Snowball</application> stemmer or <literal>simple</literal>, which recognizes everything. For example, for an astronomy-specific search (<literal>astro_en</literal> configuration) one could bind token type <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical terms, a general English dictionary and a <application>Snowball</application> English stemmer: <programlisting> ALTER TEXT SEARCH CONFIGURATION astro_en ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem; </programlisting> </para> <para> A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by the <xref linkend="unaccent"/> module. </para> <sect2 id="textsearch-stopwords"> <title>Stop Words</title> <para> Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like <literal>a</literal> and <literal>the</literal>, so it is useless to store them in an index. However, stop words do affect the positions in <type>tsvector</type>, which in turn affect ranking: <screen> SELECT to_tsvector('english', 'in the list of stop words'); to_tsvector ---------------------------- 'list':3 'stop':5 'word':6 </screen> The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different: <screen> SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 0.05 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 0.1 </screen> </para> <para> It is up to the specific dictionary how it treats stop words. For example, <literal>ispell</literal> dictionaries first normalize words and then look at the list of stop words, while <literal>Snowball</literal> stemmers first check the list of stop words. The reason for the different behavior is an attempt to decrease noise. </para> </sect2> <sect2 id="textsearch-simple-dictionary"> <title>Simple Dictionary</title> <para> The <literal>simple</literal> dictionary template operates by converting the input token

The text explains how text search configurations bind parsers and dictionaries to process tokens, using lists of dictionaries for each token type. It advises ordering dictionaries from specific to general, with Snowball stemmers or "simple" dictionaries at the end. Filtering dictionaries can be used to normalize words. Stop words, being common and lacking discrimination value, are ignored in indexing but affect tsvector positions and ranking. Dictionaries handle stop words differently, such as ispell normalizing before checking and Snowball stemmers checking first.