Skip to content

Commit

Permalink
Doc review
Browse files Browse the repository at this point in the history
Signed-off-by: Fanit Kolchina <[email protected]>
  • Loading branch information
kolchfa-aws committed Jan 3, 2025
1 parent 8ae9f5b commit 7252117
Show file tree
Hide file tree
Showing 2 changed files with 111 additions and 32 deletions.
80 changes: 67 additions & 13 deletions _analyzers/character-filters/html-character-filter.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: default
title: HTML Strip Character Filter
title: HTML strip
parent: Character filters
nav_order: 100
---
Expand All @@ -11,6 +11,8 @@ The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and

## Example: HTML analyzer

The following request applies an `html_strip` character filter to the provided text:

```json
GET /_analyze
{
Expand All @@ -23,15 +25,35 @@ GET /_analyze
```
{% include copy-curl.html %}

Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:
The response contains the token in which HTML characters have been converted to their decoded values:

```
```json
{
"tokens": [
{
"token": """
Commonly used calculus symbols include α, β and θ
""",
"start_offset": 0,
"end_offset": 74,
"type": "word",
"position": 0
}
]
}
```

## Parameters

The `html_strip` character filter can be configured with the following parameter.

| Parameter | Required/Optional | Data type | Description |
|:---|:---|:---|:---|
| `escaped_tags` | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `<b>` and `<i>` elements from being stripped.|

## Example: Custom analyzer with lowercase filter

The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:

```json
PUT /html_strip_and_lowercase_analyzer
Expand All @@ -57,9 +79,7 @@ PUT /html_strip_and_lowercase_analyzer
```
{% include copy-curl.html %}

### Testing `html_strip_and_lowercase_analyzer`

You can run the following request to test the analyzer:
Use the following request to examine the tokens generated using the analyzer:

```json
GET /html_strip_and_lowercase_analyzer/_analyze
Expand All @@ -72,8 +92,32 @@ GET /html_strip_and_lowercase_analyzer/_analyze

In the response, the HTML tags have been removed and the plain text has been converted to lowercase:

```
welcome to opensearch!
```json
{
"tokens": [
{
"token": "welcome",
"start_offset": 4,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "to",
"start_offset": 12,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "opensearch",
"start_offset": 23,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 2
}
]
}
```

## Example: Custom analyzer that preserves HTML tags
Expand Down Expand Up @@ -104,9 +148,7 @@ PUT /html_strip_preserve_analyzer
```
{% include copy-curl.html %}

### Testing `html_strip_preserve_analyzer`

You can run the following request to test the analyzer:
Use the following request to examine the tokens generated using the analyzer:

```json
GET /html_strip_preserve_analyzer/_analyze
Expand All @@ -119,6 +161,18 @@ GET /html_strip_preserve_analyzer/_analyze

In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:

```
```json
{
"tokens": [
{
"token": """
This is a <b>bold</b> and <i>italic</i> text.
""",
"start_offset": 0,
"end_offset": 52,
"type": "word",
"position": 0
}
]
}
```
63 changes: 44 additions & 19 deletions _analyzers/character-filters/mapping-character-filter.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
---
layout: default
title: Mapping Character Filter
title: Mapping
parent: Character Filters
nav_order: 120
---

# Mapping character filter

The `mapping character filter` allows you to define a map of `keys` and `values` for character replacements. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value.
The `mapping` character filter accepts a map of key-value pairs for character replacement. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value. Replacement values can be empty strings.

Matching is greedy, meaning that the longest matching pattern is prioritized. Replacements can also be empty strings if needed.
The filter applies greedy matching, meaning that the longest matching pattern is matched.

The mapping character filter helps in scenarios where specific text replacements are required before tokenization.
The `mapping` character filter helps in scenarios where specific text replacements are required before tokenization.

## Example of the mapping filter
## Example

The following example demonstrates a mapping filter that converts Roman numerals (I, II, III, IV, etc.) into their corresponding Arabic numerals (1, 2, 3, 4, etc.).
The following request configures a `mapping` character filter that converts Roman numerals (such as I, II, or III) into their corresponding Arabic numerals (1, 2, and 3):

```json
GET /_analyze
Expand All @@ -37,24 +37,38 @@ GET /_analyze
}
```

Using the mapping filter on the following text "I have III apples and IV oranges" with the mappings provided produces the response text:
The response contains a token where Roman numerals have been replaced with Arabic numerals:

```json
{
"tokens": [
{
"token": "1 have 3 apples and 4 oranges",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
```
I have 3 apples and 4 oranges
```
{% include copy-curl.html %}

## Parameters

## Configuring the mapping filter
You can use either of the following parameters to configure the key-value map.

There are two ways to configure the mappings.
1. `mappings`: Provide an array of key-value pairs in the form `key => value`. For every key found, the corresponding value will replace it in the input text.
2. `mappings_path`: Specify the path to a UTF-8 encoded file containing key-value mappings. Each mapping should be on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory.
| Parameter | Required/Optional | Data type | Description |
|:---|:---|:---|:---|
| `mappings` | Optional | Array | An array of key-value pairs in the format `key => value`. Each key found in the input text will be replaced with its corresponding value. |
| `mappings_path` | Optional | String | The path to a UTF-8 encoded file containing key-value mappings. Each mapping should appear on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory. |

### Using a custom mapping character filter

You can create a custom mapping character filter by defining your own set of mappings. The following example demonstrates the creation of a custom character filter that replaces common abbreviations in a text.
You can create a custom mapping character filter by defining your own set of mappings. The following request creates a custom character filter that replaces common abbreviations in a text:

```json
PUT /text-index
PUT /test-index
{
"settings": {
"analysis": {
Expand All @@ -80,8 +94,9 @@ PUT /text-index
}
}
```
{% include copy-curl.html %}

We can use our custom analyzer with the mappings we have provided to analzw the text "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
Use the following request to examine the tokens generated using the analyzer:

```json
GET /text-index/_analyze
Expand All @@ -92,8 +107,18 @@ GET /text-index/_analyze
}
```

With the custom mappings we provided the text is mapped to the `key` `value` pairs we submitted, this results in the text being updated as the mappings specified and we get the following response:
The response shows that the abbreviations were replaced:

```
For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.
```json
{
"tokens": [
{
"token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
"start_offset": 0,
"end_offset": 153,
"type": "word",
"position": 0
}
]
}
```

0 comments on commit 7252117

Please sign in to comment.