Text Search Parsers in PostgreSQL

<programlisting> SELECT * FROM ts_stat('SELECT vector FROM apod') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10; </programlisting> The same, but counting only word occurrences with weight <literal>A</literal> or <literal>B</literal>: <programlisting> SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10; </programlisting> </para> </sect2> </sect1> <sect1 id="textsearch-parsers"> <title>Parsers</title> <para> Text search parsers are responsible for splitting raw document text into <firstterm>tokens</firstterm> and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At present <productname>PostgreSQL</productname> provides just one built-in parser, which has been found to be useful for a wide range of applications. </para> <para> The built-in parser is named <literal>pg_catalog.default</literal>. It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser"/>. </para> <table id="textsearch-default-parser"> <title>Default Parser's Token Types</title> <tgroup cols="3"> <colspec colname="col1" colwidth="2*"/> <colspec colname="col2" colwidth="2*"/> <colspec colname="col3" colwidth="3*"/> <thead> <row> <entry>Alias</entry> <entry>Description</entry> <entry>Example</entry> </row> </thead> <tbody> <row> <entry><literal>asciiword</literal></entry> <entry>Word, all ASCII letters</entry> <entry><literal>elephant</literal></entry> </row> <row> <entry><literal>word</literal></entry> <entry>Word, all letters</entry> <entry><literal>mañana</literal></entry> </row> <row> <entry><literal>numword</literal></entry> <entry>Word, letters and digits</entry> <entry><literal>beta1</literal></entry> </row> <row> <entry><literal>asciihword</literal></entry> <entry>Hyphenated word, all ASCII</entry> <entry><literal>up-to-date</literal></entry> </row> <row> <entry><literal>hword</literal></entry> <entry>Hyphenated word, all letters</entry> <entry><literal>lógico-matemática</literal></entry> </row> <row> <entry><literal>numhword</literal></entry> <entry>Hyphenated word, letters and digits</entry> <entry><literal>postgresql-beta1</literal></entry> </row> <row> <entry><literal>hword_asciipart</literal></entry> <entry>Hyphenated word part, all ASCII</entry> <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry> </row> <row> <entry><literal>hword_part</literal></entry> <entry>Hyphenated word part, all letters</entry> <entry><literal>lógico</literal> or <literal>matemática</literal> in the context <literal>lógico-matemática</literal></entry> </row> <row> <entry><literal>hword_numpart</literal></entry> <entry>Hyphenated word part, letters and digits</entry> <entry><literal>beta1</literal> in the context <literal>postgresql-beta1</literal></entry> </row> <row> <entry><literal>email</literal></entry> <entry>Email address</entry> <entry><literal>foo@example.com</literal></entry> </row> <row> <entry><literal>protocol</literal></entry> <entry>Protocol head</entry> <entry><literal>http://</literal></entry> </row> <row> <entry><literal>url</literal></entry> <entry>URL</entry> <entry><literal>example.com/stuff/index.html</literal></entry> </row> <row> <entry><literal>host</literal></entry>

This section discusses text search parsers, which split raw document text into tokens and identify each token's type. The built-in parser in PostgreSQL, 'pg_catalog.default', recognizes 23 token types, including asciiword, word, numword, hyphenated words, email, protocol, URL, host, and various numeric and special character types. A table provides descriptions and examples for each token type.