Normalization Examples and Dictionary Functionality

<firstterm>lexeme</firstterm>. Aside from improving search quality, normalization and removal of stop words reduce the size of the <type>tsvector</type> representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics. </para> <para> Some examples of normalization: <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> Linguistic — Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings </para> </listitem> <listitem> <para> <acronym>URL</acronym> locations can be canonicalized to make equivalent URLs match: <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> http://www.pgsql.ru/db/mw/index.html </para> </listitem> <listitem> <para> http://www.pgsql.ru/db/mw/ </para> </listitem> <listitem> <para> http://www.pgsql.ru/db/../db/mw/index.html </para> </listitem> </itemizedlist> </para> </listitem> <listitem> <para> Color names can be replaced by their hexadecimal values, e.g., <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal> </para> </listitem> <listitem> <para> If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers, so for example <emphasis>3.14</emphasis>159265359, <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same after normalization if only two digits are kept after the decimal point. </para> </listitem> </itemizedlist> </para> <para> A dictionary is a program that accepts a token as input and returns: <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme) </para> </listitem> <listitem> <para> a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a <firstterm>filtering dictionary</firstterm>) </para> </listitem> <listitem> <para> an empty array if the dictionary knows the token, but it is a stop word </para> </listitem> <listitem> <para> <literal>NULL</literal> if the dictionary does not recognize the input token </para> </listitem> </itemizedlist> </para> <para> <productname>PostgreSQL</productname> provides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters. Each predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see the <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution for examples. </para> <para> A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. Normally, the first dictionary that returns a non-<literal>NULL</literal> output determines the result, and any remaining dictionaries are not consulted; but a filtering dictionary

The text provides examples of normalization, including linguistic normalization with Ispell and stemmers, URL canonicalization, color name replacement with hexadecimal values, and fractional digit removal. It explains that a dictionary takes a token as input and can return an array of lexemes, a single lexeme with the TSL_FILTER flag, an empty array for stop words, or NULL if the token is not recognized. PostgreSQL offers predefined dictionaries and templates, and custom templates can be created. A text search configuration combines a parser with dictionaries, processing tokens and discarding them if identified as stop words or unrecognized.