Dictionary Testing with `ts_lexize` and Preferred Index Types for Text Search

returns <type>setof record</type> </synopsis> <para> <function>ts_token_type</function> returns a table which describes each type of token the specified parser can recognize. For each token type, the table gives the integer <varname>tokid</varname> that the parser uses to label a token of that type, the <varname>alias</varname> that names the token type in configuration commands, and a short <varname>description</varname>. For example: <screen> SELECT * FROM ts_token_type('default'); tokid | alias | description -------+-----------------+------------------------------------------ 1 | asciiword | Word, all ASCII 2 | word | Word, all letters 3 | numword | Word, letters and digits 4 | email | Email address 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | Version number 9 | hword_numpart | Hyphenated word part, letters and digits 10 | hword_part | Hyphenated word part, all letters 11 | hword_asciipart | Hyphenated word part, all ASCII 12 | blank | Space symbols 13 | tag | XML tag 14 | protocol | Protocol head 15 | numhword | Hyphenated word, letters and digits 16 | asciihword | Hyphenated word, all ASCII 17 | hword | Hyphenated word, all letters 18 | url_path | URL path 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | XML entity </screen> </para> </sect2> <sect2 id="textsearch-dictionary-testing"> <title>Dictionary Testing</title> <para> The <function>ts_lexize</function> function facilitates dictionary testing. </para> <indexterm> <primary>ts_lexize</primary> </indexterm> <synopsis> ts_lexize(<replaceable class="parameter">dict</replaceable> <type>regdictionary</type>, <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>text[]</type> </synopsis> <para> <function>ts_lexize</function> returns an array of lexemes if the input <replaceable>token</replaceable> is known to the dictionary, or an empty array if the token is known to the dictionary but it is a stop word, or <literal>NULL</literal> if it is an unknown word. </para> <para> Examples: <screen> SELECT ts_lexize('english_stem', 'stars'); ts_lexize ----------- {star} SELECT ts_lexize('english_stem', 'a'); ts_lexize ----------- {} </screen> </para> <note> <para> The <function>ts_lexize</function> function expects a single <emphasis>token</emphasis>, not text. Here is a case where this can be confusing: <screen> SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null; ?column? ---------- t </screen> The thesaurus dictionary <literal>thesaurus_astro</literal> does know the phrase <literal>supernovae stars</literal>, but <function>ts_lexize</function> fails since it does not parse the input text but treats it as a single token. Use <function>plainto_tsquery</function> or <function>to_tsvector</function> to test thesaurus dictionaries, for example: <screen> SELECT plainto_tsquery('supernovae stars'); plainto_tsquery ----------------- 'sn' </screen> </para> </note> </sect2> </sect1> <sect1 id="textsearch-indexes"> <title>Preferred Index Types for Text Search</title> <indexterm zone="textsearch-indexes"> <primary>text search</primary> <secondary>indexes</secondary> </indexterm> <para> There are two kinds of indexes that can be used to speed up full text searches: <link linkend="gin"><acronym>GIN</acronym></link> and <link linkend="gist"><acronym>GiST</acronym></link>. Note that indexes are not mandatory for full text searching, but

This section describes the `ts_token_type` function and the token types it identifies. It then moves on to dictionary testing using the `ts_lexize` function, explaining how it returns an array of lexemes for known words, an empty array for stop words, and NULL for unknown words. It emphasizes that `ts_lexize` expects a single token as input, and suggests using `plainto_tsquery` or `to_tsvector` for testing thesaurus dictionaries. The section ends by introducing GIN and GiST indexes as ways to speed up full text searches.