Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Derived Fields feature causes search query performance regression on indices with many field mappings #16603

Open
jicohen opened this issue Nov 10, 2024 · 7 comments
Labels
bug Something isn't working Search:Performance

Comments

@jicohen
Copy link

jicohen commented Nov 10, 2024

Describe the bug

When upgrading to 2.15.0 we experienced a significant search query performance regression in our test suite. After comparing profiling output from the same test in 2.14.0, it turned out to be the result of the new 'derived fields' feature. Specifically, it appears that for every field referenced in a query it iterates over every field defined in the index mappings. Until a few days ago each iteration performed an expensive string comparison.

    @Override
    public Set<String> resolvePattern(String pattern) {
        Set<String> derivedFields = new HashSet<>();
        if (queryShardContext != null && queryShardContext.getMapperService() != null) {
            for (MappedFieldType fieldType : queryShardContext.getMapperService().fieldTypes()) {
                if (Regex.simpleMatch(pattern, fieldType.name()) && fieldType instanceof DerivedFieldType) {
                    derivedFields.add(fieldType.name());
                }
            }
        }
        for (String fieldName : derivedFieldTypeMap.keySet()) {
            if (Regex.simpleMatch(pattern, fieldName)) {
                derivedFields.add(fieldName);
            }
        }
        return derivedFields;
    }

public Set<String> resolvePattern(String pattern) {

Our use case is a little unusual in that we have a large number (~10,000) of fields defined in our mappings and we issue large queries (~150 clauses). This issue added over a second of overhead to each query we performed (~7 ms per clause) in our testing. This slowdown was present despite not using the derived field type feature at all - each referenced field in the query would cause an iteration through the entire set of mapped fields. This issue was partially addressed on the main branch a few days ago as mentioned above - by short circuiting the expensive string comparison operation if the field mapping is not a derived field type. However, the iteration itself is still present. It should be possible to precompute the set of derived fields outside of the context of a particular query (e.g. add a method MapperService.derivedFieldTypes()) and iterate over only those and not every mapped field in the index.

As a workaround, for now we have disabled the 'derived fields' feature which replaces the DefaultDerivedFieldResolver with the NoOpDerivedFieldResolver.

Related component

Search:Performance

To Reproduce

  1. On a 2.14.0 and 2.15.0 cluster do the following
  2. Create an index with a large (e.g. ~10,000) number of field mappings
  3. Issue a large boolean query with match clauses on many fields (e.g. 200)
  4. Observe the slowness of the query on 2.15.0 vs 2.14.0
  5. Run the OpenSearch clusters in a profiler to observe where the extra overhead is coming from

Expected behavior

I would expect search query performance for the scenario described above (large number of defined field mappings and large number of query clauses in the search query) to perform comparably in the newer releases of OpenSearch as it did in 2.14.0.

Additional Details

Plugins

Host/Environment (please complete the following information):

  • OS: Ubuntu
  • Version 2.15.0
@dbwiddis
Copy link
Member

dbwiddis commented Nov 11, 2024

Relevant PR: #13720
CC: @rishabhmaurya @msfroh

@msfroh
Copy link
Collaborator

msfroh commented Nov 11, 2024

@rishabhmaurya -- maybe we could define an implicit index setting that uses the NoOpDerivedFieldResolver unless at least one derived field is in the mapping? (That is, we set a Boolean flag at the index level that flips to true if/when a derived field is added to the mapping. Alternatively, if the iteration can only iterate over derived fields, it should be a no-op if no derived fields exist.)

What do you think?

@sandeshkr419
Copy link
Contributor

@rishabh6788 Do we have something in OSB to capture/monitor this?

@rishabhmaurya
Copy link
Contributor

@jicohen thanks for reporting.
@msfroh sounds like a plan

@rishabh6788
Copy link
Contributor

@rishabh6788 Do we have something in OSB to capture/monitor this?

At this point I don't think there is any workload that supports this use-case.

@rishabhmaurya
Copy link
Contributor

related to #16564

@msfroh
Copy link
Collaborator

msfroh commented Dec 6, 2024

@rishabhmaurya -- what do you think of changing this line:

if (derivedFieldAllowed) {
return new DefaultDerivedFieldResolver(queryShardContext, derivedFieldsObject, derivedFields);
} else {
return new NoOpDerivedFieldResolver();
}

to:

        if (derivedFieldAllowed && derivedFieldsPresent) {
            return new DefaultDerivedFieldResolver(queryShardContext, derivedFieldsObject, derivedFields);
        } else {
            return new NoOpDerivedFieldResolver();
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Performance
Projects
Status: 🆕 New
Development

No branches or pull requests

6 participants