-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle race condition on cache retention #569
base: main
Are you sure you want to change the base?
Conversation
c306da3
to
f716f6b
Compare
Sorry, I'm failing to understand what kind of race condition you are talking about? Could you please clarify? |
@AnatolyPopov ofc, sorry it wasn't explained properly. I have added more details on the description. This PR at least try to fix one of the known (now) causes for flaky failing tests: |
I wonder why at all this can happen. This basically means that the listener is running for a specific (key, value) pair multiple times if I understand correctly. Or is it tests only thing and the test itself cleans the file? |
@AnatolyPopov I have refactored the test to have a time-based eviction and have more consistent results (before it tested if either value 1 or 2 were deleted, not it tests if 1 or 2 or both are deleted). I have separated the exception handling for missed file, as it's nice to have but it doesn't fixes the flakiness completely. The refactoring of the test is what is trying to fix the flakiness. These are two separated commits now. PTAL |
@@ -89,7 +89,8 @@ void metrics() throws IOException, JMException, StorageBackendException { | |||
final DiskChunkCache diskChunkCache = new DiskChunkCache(chunkManager, time); | |||
diskChunkCache.configure(Map.of( | |||
"size", size1, // enough to put the first, but not both | |||
"path", baseCachePath.toString() | |||
"path", baseCachePath.toString(), | |||
"retention.ms", String.valueOf(Duration.ofSeconds(10).toMillis()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the idea of the test was different. It was intended to test that only 1 chunk is deleted because the cache size reached the limit and because of Window TinyLfu
it is not possible to say first or second chunk will be deleted. Specifying retention.ms
I believe we are missing the case when the metrics are reported correctly for a single chunk since there is a high chance that both will be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have seen failing test cases (unfortunately too old to reference) that are not dependent on time, meaning that even if we increase the timeout to more than 30s it will not help: the eviction already failed (for some reason; could be the file not exist exception), and the metrics will not tick.
This test is not meant to check if the deletion is working properly (I think it's out of the scope of this test), but to check that the deletion metric is reported.
So expanding the cache configuration to trigger deletion in a more consistent way seem like a fair trade to remove flakiness (otherwise we wait for the flaky test to trigger and hope to get enough logs to troubleshoot -- this hasn't been the case.. CI logs only show "test failed" without including logs.. maybe something else to fix?)
metrics.chunkDeleted(fileSize); | ||
log.trace("Deleted cached file for key {} with path {} from cache directory." | ||
+ " The reason of the deletion is {}", key, path, cause); | ||
if (Files.exists(path)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still failing to understand how it is possible that path
will be null or file will not exist. I think it should not be the case or otherwise something is quite wrong IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's a weird case.
The RemovalListener states:
An instance may be called concurrently by multiple threads to process different entries.
This confirms that it could be a race condition by separate threads calling the removal. I also find weird that if there's a wining thread, the metric is not increased.
Also
Implementations of this interface should avoid performing blocking calls or synchronizing on shared resources.
We currently already handle the case where path
/value is null by logging an error. This could be seen as an extension of handling the scenario where the referenced file is not found.
We are still logging this to troubleshoot if/when this happens.
Finally, some additional evidence that this test is flaky: |
While testing flakiness of disk-based cache delete metrics, it was found that eviction may happen from different reasons and the path may already be deleted. To avoid a runtime exception (that is anyway shallowed by listener execution), this PR introduces some validation before checking size.
To reduce flakiness where deletion is not executed in a consistent manner based on size, a time-based eviction configuration is added to have either 1 or 2 deletions happening while test is running to validate results. Before it was only checking either first or second value where deleted. Now it is checking 1 or 2 or both. To validated flakiness, @RepeatedTest(100000) was used, and it's now passing fine.
Also, the same test but for the memory based cache is failing on main: https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/actions/runs/10283516010/job/28457507975 Managed to reproduce locally with
Where condition is: await()
.atMost(Duration.ofSeconds(30))
.pollDelay(Duration.ofSeconds(2))
.pollInterval(Duration.ofMillis(10))
.until(() -> !mockingDetails(removalListener).getInvocations().isEmpty()); |
d3d29a0
to
615a0af
Compare
Similar to disk-based test, as deletion is not happening consistently leading to flakiness, this commit includes time-based retention to force deletion and validate metrics.
4ecdfb7
to
01deb75
Compare
if (Files.exists(path)) { | ||
final long fileSize = Files.size(path); | ||
Files.delete(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the racy nature of things here, a file may be deleted between the Files.exists
check and Files.delete
. I think, we should instead try to delete without a check and catch the "file not found" exception. "It is better to ask forgiveness than permission" :)
Cache removal listener-related tests (DiskChunkCacheMetricsTest and MemorySegmentIndexesCacheTest) are flaky. Recent evidence:
To reproduce this locally,
@RepeatedTest(10000)
has been used.The failure is caused by the timeout condition when waiting for a cache entry to be removed:
Waiting for RemovalListener to be called just after inserting a couple of entries seem to not been deterministic, and retention.ms time boundary is needed to get the removal called within the time-frame of the test (default retention.ms = 10min).
As a separate finding, while running this locally, I spot the exception of file not found after some thousand runs:
There seem to be multiple calls to this listener happening concurrently, causing this behavior (first caller to win, and the next one to don't find the file), so an additional handling is has been added. At runtime this exception is swallow by the listener execution, so this is mostly to have better logging when this happens.
This seems to be expected looking at the Caffeine docs:
The RemovalListener states:
Also