v1.4: Separator and non-separator tokens (#2553)

* separator and non-separator tokens: first draft * improve wording * address reviewer feedback * fix conflict with parent branch
meilisearch · Sep 14, 2023 · 338f2c5 · 338f2c5
1 parent f0054ab
commit 338f2c5
Show file tree

Hide file tree

Showing 4 changed files with 240 additions and 3 deletions.
diff --git a/.code-samples.meilisearch.yaml b/.code-samples.meilisearch.yaml
@@ -1120,6 +1120,28 @@ search_parameter_guide_attributes_to_search_on_1: |-
       "q": "adventure",
       "attributesToSearchOn": ["overview"]
     }'
+get_separator_tokens_1: |-
+  curl \
+    -X GET 'http://localhost:7700/indexes/articles/settings/separator-tokens'
+update_separator_tokens_1: |-
+  curl \
+    -X PUT 'http://localhost:7700/indexes/articles/settings/separator-tokens' \
+    -H 'Content-Type: application/json'  \
+    --data-binary '["|", "&hellip;"]'
+reset_separator_tokens_1: |-
+  curl \
+    -X DELETE 'http://localhost:7700/indexes/articles/settings/separator-tokens'
+get_non_separator_tokens_1: |-
+  curl \
+    -X GET 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
+update_non_separator_tokens_1: |-
+  curl \
+    -X PUT 'http://localhost:7700/indexes/articles/settings/non-separator-tokens' \
+    -H 'Content-Type: application/json'  \
+    --data-binary '["@", "#"]'
+reset_non_separator_tokens_1: |-
+  curl \
+    -X DELETE 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
 get_dictionary_1: |-
   curl \
     -X GET 'http://localhost:7700/indexes/books/settings/dictionary'

diff --git a/learn/inner_workings/datatypes.mdx b/learn/inner_workings/datatypes.mdx
@@ -12,14 +12,31 @@ String tokenization is the process of **splitting a string into a list of indivi
 
 A string is passed to a tokenizer and is then broken into separate string tokens. A token is a **word**.
 
-- For Latin-based languages, the words are separated by **space**.
-- For Kanji characters, the words are separated by **character**.
+### Tokenization
 
-For Latin-based languages, there are two kinds of **space separators**:
+Tokenization relies on two main processes to identifying words and separating them into tokens: separators and dictionaries.
+
+#### Separators
+
+Separators are characters that indicate where one word ends and another word begins. In languages using the Latin alphabet, for example, words are usually delimited by white space. In Japanese, word boundaries are more commonly indicated in other ways, such as appending particles like `に` and `で` to the end of a word.
+
+There are two kinds of separators in Meilisearch: soft and hard. Hard separators signal a significant context switch such as a new sentence or paragraph. Soft separators only delimit one word from another but do not imply a major change of subject.
+
+The list below presents some of the most common separators in languages using the Latin alphabet:
 
 - **Soft spaces** (distance: 1): whitespaces, quotes, `'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'`
 - **Hard spaces** (distance: 8): `'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'`
 
+For more separators, including those used in other writing systems like Cyrillic and Thai, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62).
+
+#### Dictionaries
+
+For the tokenization process, dictionaries are lists of groups of characters which should be considered as single term. Dictionaries are particularly useful when identifying words in languages like Japanese, where words are not always marked by separator tokens.
+
+Meilisearch comes with a number of general-use dictionaries for its officially supported languages. When working with documents containing many domain-specific terms, such as a legal documents or academic papers, providing a [custom dictionary](/reference/api/settings#dictionary) may improve search result relevancy.
+
+### Distance
+
 Distance plays an essential role in determining whether documents are relevant since [one of the ranking rules is the **proximity** rule](/learn/core_concepts/relevancy). The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered **more relevant** than two words separated by a hard space.
 
 After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.

diff --git a/learn/what_is_meilisearch/telemetry.mdx b/learn/what_is_meilisearch/telemetry.mdx
@@ -197,6 +197,8 @@ This list is liable to change with every new version of Meilisearch. It's not be
 | `displayed_attributes.total`                       | Number of displayed attributes                                                              | 3
 | `displayed_attributes.with_wildcard`               | `true` if `*` is specified as a displayed attribute, otherwise `false`                      | false
 | `stop_words.total`                                 | Number of stop words                                                                        | 3
+| `separator_tokens.total`                           | Number of separator tokens                                                                  | 3
+| `non_separator_tokens.total`                       | Number of non-separator tokens                                                              | 3
 | `dictionary.total`                                 | Number of words in the dictionary                                                           | 3
 | `synonyms.total`                                   | Number of synonyms                                                                          | 3
 | `per_index_uid`                                    | `true` if the `uid` is used to fetch an index stat resource, otherwise `false`              | false

diff --git a/reference/api/settings.mdx b/reference/api/settings.mdx
@@ -32,6 +32,8 @@ By default, the settings object looks like this. All fields are modifiable.
     "exactness"
   ],
   "stopWords": [],
+  "nonSeparatorTokens": [],
+  "separatorTokens": [],
   "dictionary": [],
   "synonyms": {},
   "distinctAttribute": null,
@@ -95,6 +97,8 @@ Get the settings of an index.
     "exactness"
   ],
   "stopWords": [],
+  "nonSeparatorTokens": [],
+  "separatorTokens": [],
   "dictionary": [],
   "synonyms": {},
   "distinctAttribute": null,
@@ -146,6 +150,8 @@ If the provided index does not exist, it will be created.
 | **[`pagination`](#pagination)**                      | Object           | [Default object](#pagination-object)                                                             | Pagination settings                                                              |
 | **[`rankingRules`](#ranking-rules)**                 | Array of strings | `["words",`<br />`"typo",`<br />`"proximity",`<br />`"attribute",`<br />`"sort",`<br />`"exactness"]` | List of ranking rules in order of importance                                |
 | **[`searchableAttributes`](#searchable-attributes)** | Array of strings | All attributes: `["*"]`                                                                          | Fields in which to search for matching query words sorted by order of importance |
+| **[`separatorTokens`](#separator-tokens)**           | Array of strings | Empty                                                                                            | List of characters delimiting where one term begins and ends                     |
+| **[`noSeparatorTokens`](#non-separator-tokens)**     | Array of strings | Empty                                                                                            | List of characters not delimiting where one term begins and ends                 |
 | **[`sortableAttributes`](#sortable-attributes)**     | Array of strings | Empty                                                                                            | Attributes to use when sorting search results                                    |
 | **[`stopWords`](#stop-words)**                       | Array of strings | Empty                                                                                            | List of words ignored by Meilisearch when present in search queries              |
 | **[`synonyms`](#synonyms)**                          | Object           | Empty                                                                                            | List of associated words treated similarly                                       |
@@ -1144,6 +1150,196 @@ Reset the searchable attributes of the index to the default value.
 
 You can use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
 
+## Separator tokens
+
+Configure strings as custom separator tokens indicating where a word ends and begins.
+
+Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use [the `nonSeparatorTokens` setting](#non-separator-tokens).
+
+### Get separator tokens
+
+<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/separator-tokens" />
+
+Get an index's list of custom separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="get_separator_tokens_1"/>
+
+##### Response: `200 Ok`
+
+```json
+[]
+```
+
+### Update separator tokens
+
+<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/separator-tokens" />
+
+Update an index's list of custom separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Body
+
+```
+["|", "&hellip;"]
+```
+
+An array of strings, with each string indicating a word separator.
+
+#### Example
+
+<CodeSamples id="update_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
+### Reset separator tokens
+
+<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/separator-tokens"/>
+
+Reset an index's list of custom separator tokens to its default value, `[]`.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="reset_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
+## Non-separator tokens
+
+Remove tokens from Meilisearch's default [list of word separators](/learn/advanced/datatypes#string).
+
+### Get non-separator tokens
+
+<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/non-separator-tokens" />
+
+Get an index's list of non-separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="get_non_separator_tokens_1"/>
+
+##### Response: `200 Ok`
+
+```json
+[]
+```
+
+### Update non-separator tokens
+
+<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/non-separator-tokens" />
+
+Update an index's list of non-separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Body
+
+```
+["@", "#"]
+```
+
+An array of strings, with each string indicating a token present in [list of word separators](/learn/advanced/datatypes#string).
+
+#### Example
+
+<CodeSamples id="update_non_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
+### Reset non-separator tokens
+
+<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/non-separator-tokens"/>
+
+Reset an index's list of non-separator tokens to its default value, `[]`.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="reset_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
 ## Sortable attributes
 
 Attributes that can be used when sorting search results using the [`sort` search parameter](/reference/api/search#sort).