Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-888][WORKER]Tweak the logic and add unit tests for the MemoryManager#currentServingState method #1811

Closed
wants to merge 7 commits into from

Conversation

zwangsheng
Copy link
Contributor

@zwangsheng zwangsheng commented Aug 14, 2023

What changes were proposed in this pull request?

Tweak the logic of MemoryManager#currentServingState

Add Unit Test for this function

graph TB

A(Check Used Memory) --> B{Reach Pause Replicate Threshold}
B --> | N | C{Reach Pause Push Threshold}
B --> | Y | Z(Trigger Pause Push and Replicate)
C --> | N | D{Reach Resume Threshold}
C --> | Y | Y(Trigger Pause Push but Resume Replicate)
D --> | N | E{In Pause Mode}
D --> | Y | X(Trigger Resume Push and Replicate)
E --> | N | U(Do Nothing)
E --> | Y | Y
Loading

Why are the changes needed?

Make this method logical, and add unit test to ensure logic won't be accidental modification

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add Unit Test

@zwangsheng zwangsheng requested review from pan3793 and FMX and removed request for pan3793 August 14, 2023 07:08
memoryPressureListener ->
memoryPressureListener.onPause(TransportModuleConstants.REPLICATE_MODULE));
trimAllListeners();
// disabled trim thread in test case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hacky, and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the unit test, we want to trigger the threshold for each state, in order to avoid this thread, reset the current service state.

For example, we set sortMemory to trigger PAUSE_PUSH state, we don't want this thread trim memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes much sense. Instead, memoryPressureListeners may be a good joint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test scenario, memoryPressureListeners shall be empty. The trim won't do anything.

@pan3793
Copy link
Member

pan3793 commented Aug 14, 2023

Make this method logical ...

Any details? What problem do you find in the current implementation?

@zwangsheng
Copy link
Contributor Author

Make this method logical ...

Any details? What problem do you find in the current implementation?

The main thing is to add unit tests to ensure that this piece of logic is changed by mistake.

The modification of the method is to reduce judgments and avoid making unnecessary comparisons before the method returns.

@zwangsheng
Copy link
Contributor Author

As sbt unit test fail, WIP @cfmcgrady

@cfmcgrady
Copy link
Contributor

As sbt unit test fail, WIP @cfmcgrady

It occurred randomly, and I'm still investigating. Let's temporarily disregard it.

@cfmcgrady
Copy link
Contributor

As sbt unit test fail, WIP @cfmcgrady

Fixed in #1818. cc @zwangsheng

@zwangsheng
Copy link
Contributor Author

As sbt unit test fail, WIP @cfmcgrady

Fixed in #1818. cc @zwangsheng

Thanks, i'll rebase and retest.

@@ -42,4 +41,42 @@ class MemoryManagerSuite extends AnyFunSuite with BeforeAndAfterEach {
caught.getMessage == s"Invalid config, ${WORKER_DIRECT_MEMORY_RATIO_PAUSE_REPLICATE.key}(0.85) " +
s"should be greater than ${WORKER_DIRECT_MEMORY_RATIO_PAUSE_RECEIVE.key}(0.95)")
}

test("[CELEBORN-888] Test MemoryManager#currentServingState trigger case") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case looks good to me.

@zwangsheng zwangsheng requested a review from FMX August 22, 2023 06:29
@codecov
Copy link

codecov bot commented Aug 22, 2023

Codecov Report

Merging #1811 (7a517ce) into main (4f47d0a) will increase coverage by 0.02%.
Report is 5 commits behind head on main.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #1811      +/-   ##
==========================================
+ Coverage   46.43%   46.45%   +0.02%     
==========================================
  Files         163      163              
  Lines       10135    10135              
  Branches      934      934              
==========================================
+ Hits         4705     4707       +2     
+ Misses       5118     5117       -1     
+ Partials      312      311       -1     

see 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

memoryPressureListener.onPause(TransportModuleConstants.REPLICATE_MODULE));
trimAllListeners();
// disabled trim thread in test case
if (!Boolean.valueOf(conf.get("celeborn.testing", "false"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that in your test scenario the memory pressure listeners are empty so maybe this check is unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never thought this! Thanks for reminding, i will make changes.

}
if (pausePushData && pauseReplicate) {
// trigger pause both push and replicate
if (memoryUsage > pauseReplicateThreshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume that a newbie comes and try to understand the method. Why should this method return PUSH_AND_REPLICATE_PAUSED when memory usage is above pauseReplicateThreshold? Will renaming pauseReplicateThreshold to pausePushAndReplicateThreshold be a better name for understanding?

Copy link
Contributor Author

@zwangsheng zwangsheng Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have same idea, maybe we should call conf celeborn.worker.directMemoryRatioToPauseReplicate => celeborn.worker.directMemoryRatioToPauseAll?

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can add comment describing the logic so that we don't need to change the config name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Merging to main/0.3

waitinfuture pushed a commit that referenced this pull request Aug 23, 2023
…oryManager#currentServingState method

Tweak the logic of `MemoryManager#currentServingState`

Add Unit Test for this function

```mermaid
graph TB

A(Check Used Memory) --> B{Reach Pause Replicate Threshold}
B --> | N | C{Reach Pause Push Threshold}
B --> | Y | Z(Trigger Pause Push and Replicate)
C --> | N | D{Reach Resume Threshold}
C --> | Y | Y(Trigger Pause Push but Resume Replicate)
D --> | N | E{In Pause Mode}
D --> | Y | X(Trigger Resume Push and Replicate)
E --> | N | U(Do Nothing)
E --> | Y | Y
```
Make this method logical, and add unit test to ensure logic won't be accidental modification

No

Add Unit Test

Closes #1811 from zwangsheng/CELEBORN-888.

Authored-by: zwangsheng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
@cfmcgrady
Copy link
Contributor

@zwangsheng Could you please help by revising the state diagram docs/assets/img/backpressure.svg in line with this refactor? The current state diagram docs/assets/img/backpressure.svg is unclear.

@zwangsheng
Copy link
Contributor Author

Could you please help by revising the state diagram docs/assets/img/backpressure.svg in line with this refactor? The current state diagram docs/assets/img/backpressure.svg is unclear.

Sure, i'd like to raise a pr to make those diagram clearly.

waitinfuture pushed a commit that referenced this pull request Aug 24, 2023
…se logic is reconstructed

### What changes were proposed in this pull request?
Add a new `backpressure.svg` to replace the out-date one.

### Why are the changes needed?
After #1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?

![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94)

Closes #1829 from zwangsheng/CELEBORN-905.

Authored-by: zwangsheng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
waitinfuture pushed a commit that referenced this pull request Aug 24, 2023
…se logic is reconstructed

Add a new `backpressure.svg` to replace the out-date one.

After #1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand.

Yes

![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94)

Closes #1829 from zwangsheng/CELEBORN-905.

Authored-by: zwangsheng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit 2ffd6d7)
Signed-off-by: zky.zhoukeyong <[email protected]>
zwangsheng added a commit to zwangsheng/RemoteShuffleService that referenced this pull request Aug 28, 2023
…se logic is reconstructed

Add a new `backpressure.svg` to replace the out-date one.

After apache#1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand.

Yes

![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94)

Closes apache#1829 from zwangsheng/CELEBORN-905.

Authored-by: zwangsheng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants