PostgreSQL text search parser using ICU boundary analysis
The first stage of full text search is parsing - breaking document and query texts into individual tokens (distinct words, numbers, etc). However there are languages where this task is not so trivial. The most prominent example is East Asian languages (such as Chinese, Japanese, Korean and more) - where words aren't typically separated by whitespace and punctuation. Another example is Hebrew; while words are separated by punctuation, some characters might be considered as punctuation marks or not depending on the context.
The default parser included with PostgreSQL's full text search subsystem provides rather unsatisfactory results in those cases. pg_icu_parser
is an extension which provides a custom full text search parser implementation that use ICU word boundary analysis routines to extract tokens out of source text. These implement the algorithms specified in Annex 29 of the Unicode standard, and can provide reasonable results across many languages.
- PostgreSQL 10+, compiled with ICU support
Currently pg_icu_parser
has to be built from source code. Make sure you have development support files (headers etc) for PostgreSQL available.
To build and install, run:
$ make install
This will build against and install into the PostgreSQL installation determined by the first instance of pg_config
found in the current PATH. To target a specific installation (or one not in PATH):
$ make install PG_CONFIG=/path/to/pg_config
The extension is also available in PGXN.
To load the extension into a database, execute the following SQL command as a user with suitable permissions:
CREATE EXTENSION pg_icu_parser;
This will register the custom text parser in the current schema. To make use of it, a text search configuration must be created as described in the PostgreSQL manual. For example:
CREATE TEXT SEARCH CONFIGURATION sample (parser = icu_parser);
ALTER TEXT SEARCH CONFIGURATION sample ADD MAPPING FOR word WITH english_stem;
ALTER TEXT SEARCH CONFIGURATION sample ADD MAPPING FOR number, ideo, kana WITH simple;
The token types which can be emitted by the parser are word
, number
, kana
, ideo
and blank
- these align with the word break tags supported by ICU. There are no restrictions as to which dictionaries can be used.
pg_icu_parser
exposes a single configuration parameter:
pg_icu_parser.locale
- string, optional. The ICU locale to use with the boundary analysis routines. If not set, defaults toen
. Typically there is not need to set this parameter, as there are no word boundary detection rules which are sensitive to the locale.
As described above, the main strength of pg_icu_parser
versus the default
parser already built into PostgreSQL are the better tokenization results across various languages. In addition, pg_icu_parser
does not depend on the database's locale setting or the underlying operating system for determining what is a letter.
However, there are several trade-offs to be aware of which can impact decision whether pg_icu_parser
is appropriate for a particular use case:
-
The default parser can recognize a wide variety of patterns as tokens, including URLs, e-mail addresses, various distinct kinds of numeric values, etc.
pg_icu_parser
on the other hand has significantly less token types (and the few that are supported are much more coarse), and can't detect any complex patterns. -
At the moment
pg_icu_parser
does not support thets_headline
function. Maybe in the future... -
pg_icu_parser
is much slower than thedefault
parser. However, if the server encoding of the database is UTF-8 a fast path is used, which reduces the overhead. See the benchmark section below. -
As a 3rd party extension,
pg_icu_parser
can only be used where full control of the PostgreSQL installation is available; that is, it probably can't be used with managed PostgreSQL solutions.
The following non-scientific benchmark can give a sense of how much of a slow down one might expect from using pg_icu_parser
instead of the default parser. The scenario tested is a query to count the number of non-blank tokens in each document in a corpus. The corpus used is the HaAretz Corpus, consisting of 27,139 short Hebrew language articles. Values are the average execution time in milliseconds of 10 rounds, as reported by EXPLAIN ANALYZE
:
Server encoding | default |
pg_icu_parser |
Slowdown |
---|---|---|---|
UTF-8 | 2,566.9974 | 2,884.3696 | -12.3% |
ISO-8859-8 | 3,529.6487 | 5,059.3616 | -43.3% |
Of course different environments and different corpora may have different results, YMMV.
pg_icu_parser
is licensed under the Mozilla Public License 2.0.