Simple Dictionary for Text Search

positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different: <screen> SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 0.05 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 0.1 </screen> </para> <para> It is up to the specific dictionary how it treats stop words. For example, <literal>ispell</literal> dictionaries first normalize words and then look at the list of stop words, while <literal>Snowball</literal> stemmers first check the list of stop words. The reason for the different behavior is an attempt to decrease noise. </para> </sect2> <sect2 id="textsearch-simple-dictionary"> <title>Simple Dictionary</title> <para> The <literal>simple</literal> dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to report non-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list. </para> <para> Here is an example of a dictionary definition using the <literal>simple</literal> template: <programlisting> CREATE TEXT SEARCH DICTIONARY public.simple_dict ( TEMPLATE = pg_catalog.simple, STOPWORDS = english ); </programlisting> Here, <literal>english</literal> is the base name of a file of stop words. The file's full name will be <filename>$SHAREDIR/tsearch_data/english.stop</filename>, where <literal>$SHAREDIR</literal> means the <productname>PostgreSQL</productname> installation's shared-data directory, often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config --sharedir</command> to determine it if you're not sure). The file format is simply a list of words, one per line. Blank lines and trailing spaces are ignored, and upper case is folded to lower case, but no other processing is done on the file contents. </para> <para> Now we can test our dictionary: <screen> SELECT ts_lexize('public.simple_dict', 'YeS'); ts_lexize ----------- {yes} SELECT ts_lexize('public.simple_dict', 'The'); ts_lexize ----------- {} </screen> </para> <para> We can also choose to return <literal>NULL</literal>, instead of the lower-cased word, if it is not found in the stop words file. This behavior is selected by setting the dictionary's <literal>Accept</literal> parameter to <literal>false</literal>. Continuing the example: <screen> ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false ); SELECT ts_lexize('public.simple_dict', 'YeS'); ts_lexize ----------- SELECT ts_lexize('public.simple_dict', 'The'); ts_lexize ----------- {} </screen> </para> <para> With the default setting of <literal>Accept</literal> = <literal>true</literal>, it is only useful to place a <literal>simple</literal> dictionary at the end of a list of dictionaries, since it will never pass on any token to a following dictionary. Conversely, <literal>Accept</literal> = <literal>false</literal> is only useful when there is at least one following dictionary. </para> <caution> <para> Most types of dictionaries rely on configuration files, such as files of stop words. These files <emphasis>must</emphasis> be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server. </para> </caution> <caution> <para> Normally, a database session will read a dictionary configuration file

The simple dictionary template converts input tokens to lowercase and checks them against a stop word file. If found, the token is discarded; otherwise, the lowercased word is returned or NULL is returned if the "Accept" parameter is set to false. The example demonstrates creating and testing a simple dictionary, including altering its behavior to return NULL for non-stop words. Simple dictionaries with default settings are best placed at the end of a dictionary list, while those with Accept = false are useful with following dictionaries. Configuration files, such as stop word files, must be stored in UTF-8 encoding.