[BUG] Fetching source uses automata even for simple matching #17114

msfroh · 2025-01-24T18:30:21Z

Describe the bug

Many years ago (2016, IIRC) the code to fetch individual source fields (according to the _source argument in a search request) was changed to always use Lucene's built-in automaton matching logic to pick which source fields to return. This possibly makes sense if there are dotted paths to object subfields or if there are wildcard patterns.

I don't think it makes sense when there's just a list of field names that someone wants to retrieve. In that case, we should probably just stick them all in a HashSet and evaluate a contains() predicate to decide which fields to include in a response.

In particular, if there are a large number of fields (and those fields have long names), we end up generating a big union between linear automata. The resulting graph can have many states and many transitions, so Lucene ends up throwing a TooComplexToDeterminizeException.

Related component

Search:Performance

To Reproduce

Create an index with a lot of fields (a few thousand), with long field names.
Run a search request that fetches a lot of those fields (a few thousand) in the _source parameter.
Get a TooComplexToDeterminizeException

Expected behavior

We shouldn't get an exception in the simple case.

(I think I'm okay with getting an exception when there are a lot of object subfields being requested or a bunch of wildcard patterns.)

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

msfroh · 2025-01-24T20:47:30Z

To clarify, I think we should add a special case to this method:

OpenSearch/server/src/main/java/org/opensearch/common/xcontent/support/XContentMapValues.java

Line 218 in 7c46f8f

    
           public static Function<Map<String, ?>, Map<String, Object>> filter(String[] includes, String[] excludes) {

Essentially, if the includes/excludes have no * or . characters, we just stick them in HashSets and return the fields that are in the includes but not the excludes.

hye-on · 2025-01-25T04:36:32Z

Hi @msfroh Can I take this issue? :)

msfroh added bug Something isn't working untriaged labels Jan 24, 2025

github-actions bot added the Search:Performance label Jan 24, 2025

github-project-automation bot added this to Search Project Board Jan 24, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Jan 24, 2025

msfroh added good first issue Good for newcomers and removed Search:Performance labels Jan 24, 2025

msfroh assigned hye-on Jan 25, 2025

hye-on linked a pull request Jan 28, 2025 that will close this issue

Add HashSet based filtering optimization to XContentMapValues #17160

Open

3 tasks

sandeshkr419 removed the untriaged label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fetching source uses automata even for simple matching #17114

[BUG] Fetching source uses automata even for simple matching #17114

msfroh commented Jan 24, 2025

msfroh commented Jan 24, 2025

hye-on commented Jan 25, 2025

[BUG] Fetching source uses automata even for simple matching #17114

[BUG] Fetching source uses automata even for simple matching #17114

Comments

msfroh commented Jan 24, 2025

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

msfroh commented Jan 24, 2025

hye-on commented Jan 25, 2025