Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Character filters - Mapping #8556

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions _analyzers/character-filters/html-character-filter.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
layout: default
title: html_strip character filter
title: HTML Strip Character Filter
parent: Character filters
nav_order: 100
---

# `html_strip` character filter
# HTML strip character filter

The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as `&nbsp;`, into spaces.

Expand Down
99 changes: 99 additions & 0 deletions _analyzers/character-filters/mapping-character-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
layout: default
title: Mapping Character Filter
parent: Character Filters
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
nav_order: 120
---

# Mapping character filter

The `mapping character filter` allows you to define a map of `keys` and `values` for character replacements. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value.

Matching is greedy, meaning that the longest matching pattern is prioritized. Replacements can also be empty strings if needed.

The mapping character filter helps in scenarios where specific text replacements are required before tokenization.

## Example of the mapping filter

The following example demonstrates a mapping filter that converts Roman numerals (I, II, III, IV, etc.) into their corresponding Arabic numerals (1, 2, 3, 4, etc.).

Check warning on line 18 in _analyzers/character-filters/mapping-character-filter.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.LatinismsElimination] Using 'etc.' is unnecessary. Remove. Raw Output: {"message": "[OpenSearch.LatinismsElimination] Using 'etc.' is unnecessary. Remove.", "location": {"path": "_analyzers/character-filters/mapping-character-filter.md", "range": {"start": {"line": 18, "column": 99}}}, "severity": "WARNING"}

Check warning on line 18 in _analyzers/character-filters/mapping-character-filter.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.LatinismsElimination] Using 'etc.' is unnecessary. Remove. Raw Output: {"message": "[OpenSearch.LatinismsElimination] Using 'etc.' is unnecessary. Remove.", "location": {"path": "_analyzers/character-filters/mapping-character-filter.md", "range": {"start": {"line": 18, "column": 159}}}, "severity": "WARNING"}

```json
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"I => 1",
"II => 2",
"III => 3",
"IV => 4",
"V => 5"
]
}
],
"text": "I have III apples and IV oranges"
}
```

Using the mapping filter on the following text "I have III apples and IV oranges" with the mappings provided produces the response text:

```
I have 3 apples and 4 oranges
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the above example Output would be 1 not I

1 have 3 apples and 4 oranges

```

## Configuring the mapping filter

There are two ways to configure the mappings.
1. `mappings`: Provide an array of key-value pairs in the form `key => value`. For every key found, the corresponding value will replace it in the input text.
2. `mappings_path`: Specify the path to a UTF-8 encoded file containing key-value mappings. Each mapping should be on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory.

### Using a custom mapping character filter

You can create a custom mapping character filter by defining your own set of mappings. The following example demonstrates the creation of a custom character filter that replaces common abbreviations in a text.

```json
PUT /text-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_abbr_analyzer": {
"tokenizer": "standard",
"char_filter": [
"custom_abbr_filter"
]
}
},
"char_filter": {
"custom_abbr_filter": {
"type": "mapping",
"mappings": [
"BTW => By the way",
"IDK => I don't know",
"FYI => For your information"
]
}
}
}
}
}
```

We can use our custom analyzer with the mappings we have provided to analzw the text "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."

Check failure on line 84 in _analyzers/character-filters/mapping-character-filter.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: analzw. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: analzw. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/character-filters/mapping-character-filter.md", "range": {"start": {"line": 84, "column": 70}}}, "severity": "ERROR"}

```json
GET /text-index/_analyze
{
"tokenizer": "keyword",
"char_filter": [ "custom_abbr_filter" ],
"text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}
```

With the custom mappings we provided the text is mapped to the `key` `value` pairs we submitted, this results in the text being updated as the mappings specified and we get the following response:

```
For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.
```
Loading