reported as a separate token type,
since it is sometimes useful to distinguish them. In most European
languages, token types <literal>word</literal> and <literal>asciiword</literal>
should be treated alike.
</para>
<para>
<literal>email</literal> does not support all valid email characters as
defined by <ulink url="https://datatracker.ietf.org/doc/html/rfc5322">RFC 5322</ulink>.
Specifically, the only non-alphanumeric characters supported for
email user names are period, dash, and underscore.
</para>
<para>
<literal>tag</literal> does not support all valid tag names as defined by
<ulink url="https://www.w3.org/TR/xml/">W3C Recommendation, XML</ulink>.
Specifically, the only tag names supported are those starting with an
ASCII letter, underscore, or colon, and containing only letters, digits,
hyphens, underscores, periods, and colons. <literal>tag</literal> also
includes XML comments starting with <literal><!--</literal> and ending
with <literal>--></literal>, and XML declarations (but note that this
includes anything starting with <literal><?x</literal> and ending with
<literal>></literal>).
</para>
</note>
<para>
It is possible for the parser to produce overlapping tokens from the same
piece of text. As an example, a hyphenated word will be reported both
as the entire word and as each component:
<screen>
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1
</screen>
This behavior is desirable since it allows searches to work for both
the whole compound word and for components. Here is another
instructive example:
<screen>
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
alias | description | token
----------+---------------+------------------------------
protocol | Protocol head | http://
url | URL | example.com/stuff/index.html
host | Host | example.com
url_path | URL path | /stuff/index.html
</screen>
</para>
</sect1>
<sect1 id="textsearch-dictionaries">
<title>Dictionaries</title>
<para>
Dictionaries are used to eliminate words that should not be considered in a
search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so
that different derived forms of the same word will match. A successfully
normalized word is called a <firstterm>lexeme</firstterm>. Aside from
improving search quality, normalization and removal of stop words reduce the
size of the <type>tsvector</type> representation of a document, thereby
improving performance. Normalization does not always have linguistic meaning
and usually depends on application semantics.
</para>
<para>
Some examples of normalization:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
Linguistic — Ispell dictionaries try to reduce input words to a
normalized form; stemmer dictionaries remove word endings
</para>
</listitem>
<listitem>
<para>
<acronym>URL</acronym> locations can be canonicalized to make
equivalent URLs match:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
http://www.pgsql.ru/db/mw/index.html
</para>
</listitem>
<listitem>
<para>
http://www.pgsql.ru/db/mw/