Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Analyzer] [Token Filters] pattern_capture loses diacritic sign from beginning and the end of the word #17111

Open
Roboteus opened this issue Jan 24, 2025 · 0 comments
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing untriaged

Comments

@Roboteus
Copy link

Roboteus commented Jan 24, 2025

Describe the bug

This is something specific to OpenSearch because I checked and looks like ElasticSearch has no issues. Unfortunatelly I compared to old version...

PUT someindex
{
	"settings": {
		"analysis": {
			"char_filter": {
				"text_number_cleaner": {
					"type": "pattern_replace",
					"pattern": "[^\\p{L}0-9-]",
					"replacement": " "
				}
			},
			"filter": {
				"capture_words": {
					"type": "pattern_capture",
					"patterns": [
						"(\\b\\p{L}+\\b)"
					]
				}
			},
			"analyzer": {
				"my_analyzer": {
					"type": "custom",
					"char_filter": [
						"text_number_cleaner"
					],
					"tokenizer": "keyword",
					"filter": [
						"lowercase",
						"capture_words"
					]
				}
			}
		}
	}
}

Now test results of the analyzer:

GET someindex
{
  "analyzer" : "my_analyzer",
  "text" : "źryj-jótro-kupę"
}

Image

Related component

Indexing

To Reproduce

  1. Create index with custom analyzer described in this ticket
  2. Test the analyzer with words containing diacritic signs
  3. Check results

Stated result

Tokens created by pattern_capture are: [ ryj , jótro , kup ]
Looks like words followed or ended by diacritic sign are trimmed by this sign

Expected behavior

[ źryj , jótro , kupę ]

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
Image

Host/Environment (please complete the following information):

  • Version [2.18]

Additional context
Add any other context about the problem here.

@Roboteus Roboteus added bug Something isn't working untriaged labels Jan 24, 2025
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jan 24, 2025
@Roboteus Roboteus changed the title [BUG] [Analyzer] [Token Filters] pattern_capture loses diacritic sign from beginning and end of the word [BUG] [Analyzer] [Token Filters] pattern_capture loses diacritic sign from beginning and the end of the word Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing untriaged
Projects
None yet
Development

No branches or pull requests

1 participant