Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Character filters - Mapping #8556

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 66 additions & 12 deletions _analyzers/character-filters/html-character-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and

## Example: HTML analyzer

The following request applies an `html_strip` character filter to the provided text:

```json
GET /_analyze
{
Expand All @@ -23,15 +25,35 @@ GET /_analyze
```
{% include copy-curl.html %}

Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:
The response contains the token in which HTML characters have been converted to their decoded values:

```
```json
{
"tokens": [
{
"token": """
Commonly used calculus symbols include α, β and θ
""",
"start_offset": 0,
"end_offset": 74,
"type": "word",
"position": 0
}
]
}
```

## Parameters

The `html_strip` character filter can be configured with the following parameter.

| Parameter | Required/Optional | Data type | Description |
|:---|:---|:---|:---|
| `escaped_tags` | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `<b>` and `<i>` elements from being stripped.|

## Example: Custom analyzer with lowercase filter

The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:

```json
PUT /html_strip_and_lowercase_analyzer
Expand All @@ -57,9 +79,7 @@ PUT /html_strip_and_lowercase_analyzer
```
{% include copy-curl.html %}

### Testing `html_strip_and_lowercase_analyzer`

You can run the following request to test the analyzer:
Use the following request to examine the tokens generated using the analyzer:

```json
GET /html_strip_and_lowercase_analyzer/_analyze
Expand All @@ -72,8 +92,32 @@ GET /html_strip_and_lowercase_analyzer/_analyze

In the response, the HTML tags have been removed and the plain text has been converted to lowercase:

```
welcome to opensearch!
```json
{
"tokens": [
{
"token": "welcome",
"start_offset": 4,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "to",
"start_offset": 12,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "opensearch",
"start_offset": 23,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 2
}
]
}
```

## Example: Custom analyzer that preserves HTML tags
Expand Down Expand Up @@ -104,9 +148,7 @@ PUT /html_strip_preserve_analyzer
```
{% include copy-curl.html %}

### Testing `html_strip_preserve_analyzer`

You can run the following request to test the analyzer:
Use the following request to examine the tokens generated using the analyzer:

```json
GET /html_strip_preserve_analyzer/_analyze
Expand All @@ -119,6 +161,18 @@ GET /html_strip_preserve_analyzer/_analyze

In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:

```
```json
{
"tokens": [
{
"token": """
This is a <b>bold</b> and <i>italic</i> text.
""",
"start_offset": 0,
"end_offset": 52,
"type": "word",
"position": 0
}
]
}
```
124 changes: 124 additions & 0 deletions _analyzers/character-filters/mapping-character-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
layout: default
title: Mapping
parent: Character filters
nav_order: 120
---

# Mapping character filter

The `mapping` character filter accepts a map of key-value pairs for character replacement. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value. Replacement values can be empty strings.

The filter applies greedy matching, meaning that the longest matching pattern is matched.

The `mapping` character filter helps in scenarios where specific text replacements are required before tokenization.

## Example

The following request configures a `mapping` character filter that converts Roman numerals (such as I, II, or III) into their corresponding Arabic numerals (1, 2, and 3):

```json
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"I => 1",
"II => 2",
"III => 3",
"IV => 4",
"V => 5"
]
}
],
"text": "I have III apples and IV oranges"
}
```

The response contains a token where Roman numerals have been replaced with Arabic numerals:

```json
{
"tokens": [
{
"token": "1 have 3 apples and 4 oranges",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
```
{% include copy-curl.html %}

## Parameters

You can use either of the following parameters to configure the key-value map.

| Parameter | Required/Optional | Data type | Description |
|:---|:---|:---|:---|
| `mappings` | Optional | Array | An array of key-value pairs in the format `key => value`. Each key found in the input text will be replaced with its corresponding value. |
| `mappings_path` | Optional | String | The path to a UTF-8 encoded file containing key-value mappings. Each mapping should appear on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory. |

### Using a custom mapping character filter

You can create a custom mapping character filter by defining your own set of mappings. The following request creates a custom character filter that replaces common abbreviations in a text:

```json
PUT /test-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_abbr_analyzer": {
"tokenizer": "standard",
"char_filter": [
"custom_abbr_filter"
]
}
},
"char_filter": {
"custom_abbr_filter": {
"type": "mapping",
"mappings": [
"BTW => By the way",
"IDK => I don't know",
"FYI => For your information"
]
}
}
}
}
}
```
{% include copy-curl.html %}

Use the following request to examine the tokens generated using the analyzer:

```json
GET /text-index/_analyze
{
"tokenizer": "keyword",
"char_filter": [ "custom_abbr_filter" ],
"text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}
```

The response shows that the abbreviations were replaced:

```json
{
"tokens": [
{
"token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
"start_offset": 0,
"end_offset": 153,
"type": "word",
"position": 0
}
]
}
```
Loading