Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Term Query documentation does not warn of performance impact of case_insensitive searches #9028

Open
1 of 4 tasks
dbwiddis opened this issue Jan 7, 2025 · 7 comments
Assignees
Labels
1 - Backlog - DEV Developer assigned to issue is responsible for creating PR.

Comments

@dbwiddis
Copy link
Member

dbwiddis commented Jan 7, 2025

What do you want to do?

  • Request a change to existing documentation
  • Add new documentation
  • Report a technical problem with the documentation
  • Other

Tell us about your request. Provide a summary of the request.

The documentation for term query includes the case_insensitive parameter.

Due to the implementation details of this type of search, every (alphabetic) character in such a query doubles the complexity of the search, consuming a lot of heap memory and potentially crashing nodes due to high CPU with GC thrashing. Even a relatively short search term (about 16 characters) could result in nearly 8 GB of heap.

This potential impact should be highlighted in the documentation as a warning. Additionally there are preferred strategies for doing case insensitive searches that should be presented as an alternative in the docs.

I'm happy to write such docs but would hope to get some assistance from @msfroh in validating them technically.

Version: List the OpenSearch version to which this issue applies, e.g. 2.14, 2.12--2.14, or all.

all

@msfroh
Copy link
Contributor

msfroh commented Jan 7, 2025

We may also want to treat this as a bug in OpenSearch.

Specifically, we should set a value for maxDeterminizedStates in the AutomatonQueries class, like here: https://github.com/opensearch-project/OpenSearch/blob/ad7ce4cd446523d52286530ccfbb1dabc2dc7f86/server/src/main/java/org/opensearch/common/lucene/search/AutomatonQueries.java#L88

Elsewhere -- like in RegexpQueryBuilder, we default to setting max_determinized_states to 10000 (by referencing the constant defined in Lucene as Operations.DEFAULT_DETERMINIZE_WORK_LIMIT).

IMO, we should:

  1. deprecate all of the existing caseInsensitive*Query methods in AutomatonQueries,
  2. replace them with calls that specify maxDeterminizedStates,
  3. add max_determinized_states as a query parameter to any query type that may generate an automaton query, and
  4. make the default max_determinized_states value for those query types Integer.MAX_VALUE on the 2.x branch (for dangerous backward compatibility) and Operations.DEFAULT_DETERMINIZE_WORK_LIMIT (i.e. 10000) on the main branch so we're safe by default on 3.0.

That way, folks using 2.19 can at least safeguard themselves by explicitly setting max_determinized_states to something reasonable, while 3.0 is safe by default (but we let users risk shooting themselves in the foot if they explicitly ask to).

@msfroh
Copy link
Contributor

msfroh commented Jan 7, 2025

I'm going to turn my previous message into an issue for https://github.com/opensearch-project/OpenSearch

@msfroh
Copy link
Contributor

msfroh commented Jan 7, 2025

@Naarcha-AWS Naarcha-AWS added 1 - Backlog - DEV Developer assigned to issue is responsible for creating PR. and removed untriaged labels Jan 10, 2025
@Naarcha-AWS
Copy link
Collaborator

@dbwiddis: Assigned this to you.

@dbwiddis
Copy link
Member Author

@msfroh do you think we'll address your bug before 2.19.0? That may change how I write this.

@msfroh
Copy link
Contributor

msfroh commented Jan 11, 2025

@msfroh do you think we'll address your bug before 2.19.0? That may change how I write this.

I'll try to get a fix in for next week. I'd really like to have the opt-in safety in 2.19 before making it opt-out in 3.0.

@dbwiddis
Copy link
Member Author

I'll track that change and do the doc PR for it here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - Backlog - DEV Developer assigned to issue is responsible for creating PR.
Projects
None yet
Development

No branches or pull requests

3 participants