-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle race condition on cache retention #569
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -98,11 +98,16 @@ public RemovalListener<ChunkKey, Path> removalListener() { | |
return (key, path, cause) -> { | ||
try { | ||
if (path != null) { | ||
final long fileSize = Files.size(path); | ||
Files.delete(path); | ||
metrics.chunkDeleted(fileSize); | ||
log.trace("Deleted cached file for key {} with path {} from cache directory." | ||
+ " The reason of the deletion is {}", key, path, cause); | ||
if (Files.exists(path)) { | ||
final long fileSize = Files.size(path); | ||
Files.delete(path); | ||
Comment on lines
+101
to
+103
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Considering the racy nature of things here, a file may be deleted between the |
||
metrics.chunkDeleted(fileSize); | ||
log.trace("Deleted cached file for key {} with path {} from cache directory." | ||
+ " The reason of the deletion is {}", key, path, cause); | ||
} else { | ||
log.debug("Deletion of cached file for key {} with " | ||
+ "path {} is requested by file is not found", key, path); | ||
} | ||
} else { | ||
log.warn("Path not present when trying to delete cached file for key {} from cache directory." | ||
+ " The reason of the deletion is {}", key, cause); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,7 +89,8 @@ void metrics() throws IOException, JMException, StorageBackendException { | |
final DiskChunkCache diskChunkCache = new DiskChunkCache(chunkManager, time); | ||
diskChunkCache.configure(Map.of( | ||
"size", size1, // enough to put the first, but not both | ||
"path", baseCachePath.toString() | ||
"path", baseCachePath.toString(), | ||
"retention.ms", String.valueOf(Duration.ofSeconds(10).toMillis()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the idea of the test was different. It was intended to test that only 1 chunk is deleted because the cache size reached the limit and because of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have seen failing test cases (unfortunately too old to reference) that are not dependent on time, meaning that even if we increase the timeout to more than 30s it will not help: the eviction already failed (for some reason; could be the file not exist exception), and the metrics will not tick. |
||
)); | ||
|
||
diskChunkCache.getChunk(OBJECT_KEY_PATH, SEGMENT_MANIFEST, 0); | ||
|
@@ -118,28 +119,40 @@ void metrics() throws IOException, JMException, StorageBackendException { | |
assertThat(MBEAN_SERVER.getAttribute(objectName, "write-bytes-rate")) | ||
.isEqualTo(((double) (size1 + size2)) / METRIC_TIME_WINDOW_SEC); | ||
|
||
// because of the retention ms, it may be deleting cached values 1, 2 or both. | ||
await("Deletion happens") | ||
.atMost(Duration.ofSeconds(30)) // increase to reduce chance of flakiness | ||
.atMost(Duration.ofSeconds(30)) | ||
.pollDelay(Duration.ofMillis(100)) | ||
.pollInterval(Duration.ofMillis(100)) | ||
.until(() -> (double) MBEAN_SERVER.getAttribute(objectName, "delete-total") > 0); | ||
|
||
assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-total")) | ||
.isEqualTo(1.0); | ||
.asInstanceOf(DOUBLE) | ||
.satisfiesAnyOf( | ||
deleteTotal -> assertThat(deleteTotal).isEqualTo(1), | ||
deleteTotal -> assertThat(deleteTotal).isEqualTo(2) | ||
); | ||
assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-rate")) | ||
.isEqualTo(1.0 / METRIC_TIME_WINDOW_SEC); | ||
.satisfiesAnyOf( | ||
deleteTotalRate -> assertThat(deleteTotalRate).isEqualTo(1.0 / METRIC_TIME_WINDOW_SEC), | ||
deleteTotalRate -> assertThat(deleteTotalRate).isEqualTo(2.0 / METRIC_TIME_WINDOW_SEC) | ||
); | ||
|
||
assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-bytes-total")) | ||
.asInstanceOf(DOUBLE) | ||
.satisfiesAnyOf( | ||
deleteBytesTotal -> assertThat(deleteBytesTotal).asInstanceOf(DOUBLE).isEqualTo(size1), | ||
deleteBytesTotal -> assertThat(deleteBytesTotal).asInstanceOf(DOUBLE).isEqualTo(size2) | ||
deleteBytesTotal -> assertThat(deleteBytesTotal).isEqualTo(size1), | ||
deleteBytesTotal -> assertThat(deleteBytesTotal).isEqualTo(size2), | ||
deleteBytesTotal -> assertThat(deleteBytesTotal).isEqualTo(size1 + size2) | ||
); | ||
assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-bytes-rate")) | ||
.satisfiesAnyOf( | ||
deleteBytesRate -> assertThat(deleteBytesRate) | ||
.isEqualTo((double) size1 / METRIC_TIME_WINDOW_SEC), | ||
deleteBytesRate -> assertThat(deleteBytesRate) | ||
.isEqualTo((double) size2 / METRIC_TIME_WINDOW_SEC) | ||
.isEqualTo((double) size2 / METRIC_TIME_WINDOW_SEC), | ||
deleteBytesRate -> assertThat(deleteBytesRate) | ||
.isEqualTo((double) (size1 + size2) / METRIC_TIME_WINDOW_SEC) | ||
); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still failing to understand how it is possible that
path
will be null or file will not exist. I think it should not be the case or otherwise something is quite wrong IMO.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's a weird case.
The RemovalListener states:
This confirms that it could be a race condition by separate threads calling the removal. I also find weird that if there's a wining thread, the metric is not increased.
Also
We currently already handle the case where
path
/value is null by logging an error. This could be seen as an extension of handling the scenario where the referenced file is not found.We are still logging this to troubleshoot if/when this happens.