-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1760] OOM causes disk buffer unable to be released #2975
Conversation
|
@RexXiong @AngersZhuuuu PTAL. |
@leixm Can you provide some memory metrics with this PR after a worker is OOM? |
|
It seems that the diskBufferCount has decreased, but the total off-heap memory still cannot be released. |
The catch Exception here will cause fileWriter.decrementPendingWrites() to not be called, but I am not sure whether it will indirectly cause the off-heap memory to not be released. I will investigate further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The off-heap memory is not released if the disk buffer counter is decreased. This PR failed to complete its purpose. There is more to investigate to find out what happened before merging into main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More information is needed.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionDataWriter.java
Show resolved
Hide resolved
There are three problems in total
|
@@ -742,7 +742,12 @@ private[celeborn] class Worker( | |||
synchronized { | |||
val expiredApplicationIds = new JHashSet[String]() | |||
expiredShuffleKeys.asScala.foreach { shuffleKey => | |||
partitionLocationInfo.removeShuffle(shuffleKey) | |||
partitionLocationInfo.removeShuffle(shuffleKey).foreach { loc => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StorageManager will remove expired shuffle keys which will destroy the writerd and return buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@leixm would you mind updating the PR description to summarize the conclusion? |
@@ -416,7 +416,13 @@ public void write(ByteBuf data) throws IOException { | |||
} | |||
|
|||
data.retain(); | |||
flushBuffer.addComponent(true, data); | |||
try { | |||
flushBuffer.addComponent(true, data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the non-memory shuffle file storage type requires a call to releaseDiskBuffer?
Done, |
ping @FMX |
### What changes were proposed in this pull request? When OOM occurs in flushBuffer.addComponent, there are two problems. 1. decrementPendingWrites is not called, causing close PartitionDataWriter to be stuck for a period of time during commit. 2. After OOM occurs, ByteBuf is not released, causing memory leaks. ### Why are the changes needed? Fix disk buffer unable to be released ### Does this PR introduce _any_ user-facing change? Yes, it fixes a memory leak issue in some corner cases. ### How was this patch tested? I did some tests. 2 of the nodes did not have this PR, and the memory of these two nodes could not be released. 1 node had this PR, and the memory could be released. It was obviously much lower than the previous 2 nodes. ![image](https://github.com/user-attachments/assets/3fe846ec-6ee8-432a-be7a-a7efb7c102d0) Closes #2975 from leixm/CELEBORN-1760. Authored-by: Xianming Lei <[email protected]> Signed-off-by: Shuang <[email protected]> (cherry picked from commit 372ef79) Signed-off-by: Shuang <[email protected]>
Thanks merge to main(v0.6.0) and branch-0.5(v0.5.3) |
What changes were proposed in this pull request?
When OOM occurs in flushBuffer.addComponent, there are two problems.
Why are the changes needed?
Fix disk buffer unable to be released
Does this PR introduce any user-facing change?
Yes, it fixes a memory leak issue in some corner cases.
How was this patch tested?
I did some tests. 2 of the nodes did not have this PR, and the memory of these two nodes could not be released. 1 node had this PR, and the memory could be released. It was obviously much lower than the previous 2 nodes.