Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thread to periodically perform pending cache maintenance #2308

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

owenhalpert
Copy link

@owenhalpert owenhalpert commented Dec 3, 2024

Description

Adds a ScheduledExecutor class that, for each cache, takes in a Runnable with a call to cleanUp and periodically executes according to the values of KNN_CACHE_ITEM_EXPIRY_TIME_MINUTES and QUANTIZATION_STATE_CACHE_EXPIRY_TIME_MINUTES. This will perform any pending maintenance (such as evicting expired entries) which was previously only performed when the cache was accessed. The maintenance thread is created whenever a NativeMemoryCache or QuantizationStateCache is instantiated or rebuilt and can be shut down with either class's close method. Relevant logic for cleanup was added to some testing base classes.

Related Issues

Resolves #2239

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kotwanikunal
Copy link
Member

@owenhalpert The changes look good in general. What I'd be interested in is testing under load for a resource constrained system - can you verify if this adds to the latency or impact the performance in any way?

We did implement a force evict before writes with #2015. Can you also enable this feature flag and run the above tests to ensure it behaves well?

* for more details. Thus, to perform any pending maintenance, the cleanUp method will be called periodically from a CacheMaintainer instance.
*/
public class CacheMaintainer<K, V> implements Closeable {
private final Cache<K, V> cache;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also avoid maintaining the Cache object reference here by using a functional interface. That would also get rid of the generification of this class.

Simply pass and store the runnable reference instead of the cache as new CacheMaintainer(() -> cache.cleanUp());

Possibly also move this class to the util package and call it a ScheduledExecutor with Runnable ref and interval as parameters.

public class ScheduledExecutor implements Closeable {

...

  public ScheduledExecutor(Runnable reference, long scheduleMillis) {
  ...
  }

  ...
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kotwanikunal what do you think about creating the executor and calling scheduleAtFixedRate within each cache class instead of creating a new ScheduledExecutor class?

@owenhalpert
Copy link
Author

@kotwanikunal I've completed the benchmarking on a single node cluster limited to 3GB of memory.

Before benchmarking my code, I validated the CacheMaintainer was actually running by inspecting the OpenSearch process status in the Docker container and found that after about a minute, the RssAnon value would decrease by about 0.5gb without any interference on my part, signaling the maintainer successfully cleaned up the expired entries. This is aligned with what I saw on the standard maintenance on the clean code (which I triggered manually by accessing the cache).

Results below:

Performance Summary for Resource-Constrained Testing (3GB Memory Limit)

Run 1: Clean 2.18 code

  • Indexing:
    • p50 latency: 16.96 ms
    • p90 latency: 32.47 ms
  • Search:
    • p50 latency: 330.48 ms
    • p90 latency: 437.78 ms

Run 2: Clean 2.18 code, Force evict ON

  • Indexing:
    • p50 latency: 17.27 ms
    • p90 latency 33.82 ms
  • Search:
    • p50 latency 337.43 ms
    • p90 latency 428.70 ms

Run 3: PR changes added, Force evict ON:

  • Indexing:
    • p50 latency: 16.89 ms
    • p90 latency: 32.24 ms
  • Search:
    • p50 latency: 341.38 ms
    • p90 latency: 414.06 ms

Run 4: PR changes added, Force evict OFF:

  • Indexing:
    • p50 latency: 17.43 ms
    • p90 latency: 34.63 ms
  • Search:
    • p50 latency: 346.83 ms
    • p90 latency: 430.04 ms

This suggests there is no significant impact on latency with my code changes. I've included the full results of these test runs here:

https://gist.github.com/owenhalpert/05ad4f5ae9577f717f2c59f2039d52e4

@owenhalpert owenhalpert changed the title Add CacheMaintainer class to perform pending cache maintenance every minute Add thread to perform pending cache maintenance every minute Dec 19, 2024
@owenhalpert owenhalpert force-pushed the cache-maintenance branch 2 times, most recently from dc61f6e to 1ca8dff Compare December 19, 2024 19:20
@owenhalpert owenhalpert marked this pull request as ready for review December 19, 2024 20:09
@owenhalpert owenhalpert changed the title Add thread to perform pending cache maintenance every minute Add thread to periodically perform pending cache maintenance Dec 21, 2024
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments overall looks good

try {
cache.cleanUp();
} catch (Exception e) {
logger.error("Error cleaning up cache", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why swallow exception here?

Copy link
Author

@owenhalpert owenhalpert Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any exceptions from Guava cache operations would otherwise halt our scheduled executor. If an exception occurs here, it would be from Guava's internals rather than our logic, so per Dooyong's suggestion above we can log it and continue scheduling cleanup tasks.

This way we can ensure the cache maintenance keeps running even if individual cleanup attempts fail, and we can monitor Guava errors in the logs.

*/
public ScheduledExecutor(Runnable task, long scheduleMillis) {
this.task = task;
this.executor = Executors.newSingleThreadScheduledExecutor();
Copy link
Collaborator

@shatejas shatejas Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be creating a new thread for each instance, I wanted to understand the thought process behind this.

I understand that this is used for cache so the threads won't grow exponentially, but since its in util package and the name is generic - any misuse of this can create unbounded number of threads with each new instance. Can we be a bit more defensive here?

The suggestion here is to refactor and make sure there are fixed number of threads for cache cleanup across caches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perform cache maintenance in a separate thread
5 participants