Overlapping Tokens and Dictionary Introduction

reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token types <literal>word</literal> and <literal>asciiword</literal> should be treated alike. </para> <para> <literal>email</literal> does not support all valid email characters as defined by <ulink url="https://datatracker.ietf.org/doc/html/rfc5322">RFC 5322</ulink>. Specifically, the only non-alphanumeric characters supported for email user names are period, dash, and underscore. </para> <para> <literal>tag</literal> does not support all valid tag names as defined by <ulink url="https://www.w3.org/TR/xml/">W3C Recommendation, XML</ulink>. Specifically, the only tag names supported are those starting with an ASCII letter, underscore, or colon, and containing only letters, digits, hyphens, underscores, periods, and colons. <literal>tag</literal> also includes XML comments starting with <literal></literal>, and XML declarations (but note that this includes anything starting with <literal><?x</literal> and ending with <literal>></literal>). </para> </note> <para> It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component: <screen> SELECT alias, description, token FROM ts_debug('foo-bar-beta1'); alias | description | token -----------------+------------------------------------------+--------------- numhword | Hyphenated word, letters and digits | foo-bar-beta1 hword_asciipart | Hyphenated word part, all ASCII | foo blank | Space symbols | - hword_asciipart | Hyphenated word part, all ASCII | bar blank | Space symbols | - hword_numpart | Hyphenated word part, letters and digits | beta1 </screen> This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example: <screen> SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html'); alias | description | token ----------+---------------+------------------------------ protocol | Protocol head | http:// url | URL | example.com/stuff/index.html host | Host | example.com url_path | URL path | /stuff/index.html </screen> </para> </sect1> <sect1 id="textsearch-dictionaries"> <title>Dictionaries</title> <para> Dictionaries are used to eliminate words that should not be considered in a search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so that different derived forms of the same word will match. A successfully normalized word is called a <firstterm>lexeme</firstterm>. Aside from improving search quality, normalization and removal of stop words reduce the size of the <type>tsvector</type> representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics. </para> <para> Some examples of normalization: <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> Linguistic — Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings </para> </listitem> <listitem> <para> <acronym>URL</acronym> locations can be canonicalized to make equivalent URLs match: <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> http://www.pgsql.ru/db/mw/index.html </para> </listitem> <listitem> <para> http://www.pgsql.ru/db/mw/

The parser can produce overlapping tokens from the same text, such as for hyphenated words and URLs, allowing searches for both the whole compound and its components. The section then transitions to discussing dictionaries in PostgreSQL, which are used to eliminate stop words and normalize words into lexemes, improving search quality, reducing tsvector size, and enhancing performance. Normalization can be linguistic or application-specific, such as canonicalizing URLs.