Skip to content

Commit

Permalink
Update README to point to pg_icu_parser
Browse files Browse the repository at this point in the history
  • Loading branch information
IgKh committed Oct 9, 2022
1 parent 8f6b838 commit 40e4da6
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ A simple and simplistic PostgreSQL extension providing a full text search dictio

## Overview

The Hebrew language is traditionally considered to be difficult to perform documents retrieval tasks on. Its rich morphology means that words have a very large amount of inflections on one hand, and widespread presence of homographs leads to ambiguity. All of that means that full text search systems tend to suffer from poor recall out of the box when dealing with Hebrew texts.
The Hebrew language is traditionally considered to be difficult to perform document retrieval tasks on. Its rich morphology means that words have a very large amount of inflections on one hand, and widespread presence of homographs leads to ambiguity. All of that means that full text search systems tend to suffer from poor recall out of the box when dealing with Hebrew texts.

`pg_hspell` is a PostgreSQL extension that tries to help with such tasks when using the database's built-in full text search subsystem. It uses the dictionary and linguistic information provided by the hspell project to provide a Postgres dictionary template which lemmatizes Hebrew words as part of a configuration pipeline.

Expand Down Expand Up @@ -43,13 +43,13 @@ $ make install PG_CONFIG=/path/to/pg_config

To load the extension into a database, execute the following SQL command as a suitably permissioned user:

```
```sql
CREATE EXTENSION pg_hspell;
```

This will place into the current schema a full text dictionary called `hspell` which is configured with a bundled list of common Hebrew stop words. To create a dictionary with a different stop word list (or none at all), do something like the following SQL command:

```
```sql
CREATE TEXT SEARCH DICTIONARY my_hspell_dict (
TEMPLATE = hspell,
[ STOPWORDS = my_stop_words_file ]
Expand Down Expand Up @@ -80,7 +80,7 @@ If processing dotted text is desired, Niqqud has to be stripped prior to passing

### A note about parsing

Please note that the default text search parser included with PostgreSQL does not correctly handle corner cases specific to Hebrew where characters usually considered to be punctuation (i.e. apostrophe and quotation mark) do not act as such when embedded into a work. Such cases are common in Hebrew computer texts in acronyms and abbreviations, which may not be tokenized as expected.
Please note that the default text search parser included with PostgreSQL does not correctly handle corner cases specific to Hebrew where characters usually considered to be punctuation (i.e. apostrophe and quotation mark) do not act as such when embedded into a word. Such cases are common in Hebrew computer texts in acronyms and abbreviations, which may not be tokenized as expected.

For example:
```
Expand All @@ -92,7 +92,7 @@ postgres=# select * from ts_parse('default', $$ נתב"ג $$);
2 | ג
```

This is not something specific to `pg_hspell` or within its' scope to address. If there are specific instances that are particularly bothersome, they may be worked around with a [Thesaurus dictionary](https://www.postgresql.org/docs/current/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS).
This is not something specific to `pg_hspell` or within its' scope to address. If there are specific instances that are particularly bothersome, they may be worked around with a [Thesaurus dictionary](https://www.postgresql.org/docs/current/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS). You may also consider the parser provided by the [pg_icu_parser extension](https://github.com/IgKh/pg_icu_parser), which handles this correctly.

## License

Expand Down
10 changes: 10 additions & 0 deletions dict_hspell.c
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
/*
* dict_hspell.c
*
* Copyright (c) 2022 Igor Khanin
*
* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, You can obtain one at https://mozilla.org/MPL/2.0/.
*/

// PostgreSQL server API includes
#include <postgres.h>

Expand Down

0 comments on commit 40e4da6

Please sign in to comment.