-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-888][WORKER]Tweak the logic and add unit tests for the MemoryManager#currentServingState method #1811
Conversation
memoryPressureListener -> | ||
memoryPressureListener.onPause(TransportModuleConstants.REPLICATE_MODULE)); | ||
trimAllListeners(); | ||
// disabled trim thread in test case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is hacky, and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the unit test, we want to trigger the threshold for each state, in order to avoid this thread, reset the current service state.
For example, we set sortMemory to trigger PAUSE_PUSH state, we don't want this thread trim memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes much sense. Instead, memoryPressureListeners
may be a good joint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the test scenario, memoryPressureListeners shall be empty. The trim won't do anything.
Any details? What problem do you find in the current implementation? |
The main thing is to add unit tests to ensure that this piece of logic is changed by mistake. The modification of the method is to reduce judgments and avoid making unnecessary comparisons before the method returns. |
As sbt unit test fail, WIP @cfmcgrady |
It occurred randomly, and I'm still investigating. Let's temporarily disregard it. |
Fixed in #1818. cc @zwangsheng |
Thanks, i'll rebase and retest. |
…ryManager#currentServingState method
57cad86
to
8dd8a30
Compare
@@ -42,4 +41,42 @@ class MemoryManagerSuite extends AnyFunSuite with BeforeAndAfterEach { | |||
caught.getMessage == s"Invalid config, ${WORKER_DIRECT_MEMORY_RATIO_PAUSE_REPLICATE.key}(0.85) " + | |||
s"should be greater than ${WORKER_DIRECT_MEMORY_RATIO_PAUSE_RECEIVE.key}(0.95)") | |||
} | |||
|
|||
test("[CELEBORN-888] Test MemoryManager#currentServingState trigger case") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case looks good to me.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## main #1811 +/- ##
==========================================
+ Coverage 46.43% 46.45% +0.02%
==========================================
Files 163 163
Lines 10135 10135
Branches 934 934
==========================================
+ Hits 4705 4707 +2
+ Misses 5118 5117 -1
+ Partials 312 311 -1 see 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
memoryPressureListener.onPause(TransportModuleConstants.REPLICATE_MODULE)); | ||
trimAllListeners(); | ||
// disabled trim thread in test case | ||
if (!Boolean.valueOf(conf.get("celeborn.testing", "false"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that in your test scenario the memory pressure listeners are empty so maybe this check is unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never thought this! Thanks for reminding, i will make changes.
} | ||
if (pausePushData && pauseReplicate) { | ||
// trigger pause both push and replicate | ||
if (memoryUsage > pauseReplicateThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assume that a newbie comes and try to understand the method. Why should this method return PUSH_AND_REPLICATE_PAUSED when memory usage is above pauseReplicateThreshold? Will renaming pauseReplicateThreshold to pausePushAndReplicateThreshold be a better name for understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have same idea, maybe we should call conf celeborn.worker.directMemoryRatioToPauseReplicate
=> celeborn.worker.directMemoryRatioToPauseAll
?
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we can add comment describing the logic so that we don't need to change the config name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Merging to main/0.3
…oryManager#currentServingState method Tweak the logic of `MemoryManager#currentServingState` Add Unit Test for this function ```mermaid graph TB A(Check Used Memory) --> B{Reach Pause Replicate Threshold} B --> | N | C{Reach Pause Push Threshold} B --> | Y | Z(Trigger Pause Push and Replicate) C --> | N | D{Reach Resume Threshold} C --> | Y | Y(Trigger Pause Push but Resume Replicate) D --> | N | E{In Pause Mode} D --> | Y | X(Trigger Resume Push and Replicate) E --> | N | U(Do Nothing) E --> | Y | Y ``` Make this method logical, and add unit test to ensure logic won't be accidental modification No Add Unit Test Closes #1811 from zwangsheng/CELEBORN-888. Authored-by: zwangsheng <[email protected]> Signed-off-by: zky.zhoukeyong <[email protected]>
@zwangsheng Could you please help by revising the state diagram |
Sure, i'd like to raise a pr to make those diagram clearly. |
…se logic is reconstructed ### What changes were proposed in this pull request? Add a new `backpressure.svg` to replace the out-date one. ### Why are the changes needed? After #1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? ![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94) Closes #1829 from zwangsheng/CELEBORN-905. Authored-by: zwangsheng <[email protected]> Signed-off-by: zky.zhoukeyong <[email protected]>
…se logic is reconstructed Add a new `backpressure.svg` to replace the out-date one. After #1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand. Yes ![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94) Closes #1829 from zwangsheng/CELEBORN-905. Authored-by: zwangsheng <[email protected]> Signed-off-by: zky.zhoukeyong <[email protected]> (cherry picked from commit 2ffd6d7) Signed-off-by: zky.zhoukeyong <[email protected]>
…se logic is reconstructed Add a new `backpressure.svg` to replace the out-date one. After apache#1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand. Yes ![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94) Closes apache#1829 from zwangsheng/CELEBORN-905. Authored-by: zwangsheng <[email protected]> Signed-off-by: zky.zhoukeyong <[email protected]>
What changes were proposed in this pull request?
Tweak the logic of
MemoryManager#currentServingState
Add Unit Test for this function
Why are the changes needed?
Make this method logical, and add unit test to ensure logic won't be accidental modification
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add Unit Test