Tokenization, Lexemes, and Document Preprocessing

principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes. <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to perform this step. A standard parser is provided, and custom parsers can be created for specific needs. </para> </listitem> <listitem> <para> <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>. A lexeme is a string, just like a token, but it has been <firstterm>normalized</firstterm> so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as <literal>s</literal> or <literal>es</literal> in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates <firstterm>stop words</firstterm>, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.) <productname>PostgreSQL</productname> uses <firstterm>dictionaries</firstterm> to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs. </para> </listitem> <listitem> <para> <emphasis>Storing preprocessed documents optimized for searching</emphasis>. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for <firstterm>proximity ranking</firstterm>, so that a document that contains a more <quote>dense</quote> region of query words is assigned a higher rank than one with scattered query words. </para> </listitem> </itemizedlist> <para> Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can: </para> <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> Define stop words that should not be indexed. </para> </listitem> <listitem> <para> Map synonyms to a single word using <application>Ispell</application>. </para> </listitem> <listitem> <para> Map phrases to a single word using a thesaurus. </para> </listitem> <listitem> <para> Map different variations of a word to a canonical form using an <application>Ispell</application> dictionary. </para> </listitem> <listitem> <para> Map different variations of a word to a canonical form using <application>Snowball</application> stemmer rules. </para> </listitem> </itemizedlist> <para> A data type <type>tsvector</type> is provided for storing preprocessed documents, along with a type <type>tsquery</type> for representing processed queries (<xref linkend="datatype-textsearch"/>). There are many functions and operators available for these data types (<xref linkend="functions-textsearch"/>), the most important of which is the match operator <literal>@@</literal>, which we introduce in <xref linkend="textsearch-matching"/>. Full text searches can be accelerated using indexes (<xref linkend="textsearch-indexes"/>). </para> <sect2 id="textsearch-document"> <title>What Is a Document?</title> <indexterm zone="textsearch-document"> <primary>document</primary> <secondary>text search</secondary> </indexterm> <para> A <firstterm>document</firstterm> is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes

Full text search involves preprocessing documents by parsing text into tokens and converting them into normalized lexemes using dictionaries, which helps in eliminating stop words and mapping synonyms. PostgreSQL offers standard and custom parsers/dictionaries for this process. Preprocessed documents are stored, optimized for searching, using the tsvector data type. The tsquery data type represents processed queries. Dictionaries facilitate fine-grained control over token normalization, including synonym mapping and stemming. The match operator @@ and indexes accelerate full text searches.