Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-905] Redraw the flowchart backpressure.svg after worker pause logic is reconstructed #1829

Closed
wants to merge 3 commits into from

Conversation

zwangsheng
Copy link
Contributor

@zwangsheng zwangsheng commented Aug 23, 2023

What changes were proposed in this pull request?

Add a new backpressure.svg to replace the out-date one.

Why are the changes needed?

After #1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

backpressure

@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Merging #1829 (8890db2) into main (ad890e9) will decrease coverage by 0.15%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1829      +/-   ##
==========================================
- Coverage   46.59%   46.43%   -0.15%     
==========================================
  Files         163      163              
  Lines       10135    10135              
  Branches      934      934              
==========================================
- Hits         4721     4705      -16     
- Misses       5105     5118      +13     
- Partials      309      312       +3     
Files Changed Coverage Δ
...cala/org/apache/celeborn/common/CelebornConf.scala 87.48% <100.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@waitinfuture
Copy link
Contributor

LGTM. BTW, I think default 0.5 for celeborn.worker.directMemoryRatioToResume is too small, shall we change to 0.7?

@cfmcgrady cfmcgrady changed the title [CELEBORN-905]Redraw the flowchart backpressure.svg after worker pause logic is reconstructed [CELEBORN-905] Redraw the flowchart backpressure.svg after worker pause logic is reconstructed Aug 23, 2023
@zwangsheng
Copy link
Contributor Author

LGTM. BTW, I think default 0.5 for celeborn.worker.directMemoryRatioToResume is too small, shall we change to 0.7?

+1 for setting default resume ratio to 0.7, I tried set 0.65 on a production-scale cluster with good results. Sounds 0.7 is reasonable.

@waitinfuture
Copy link
Contributor

LGTM. BTW, I think default 0.5 for celeborn.worker.directMemoryRatioToResume is too small, shall we change to 0.7?

+1 for setting default resume ratio to 0.7, I tried set 0.65 on a production-scale cluster with good results. Sounds 0.7 is reasonable.

@zwangsheng Nice, so maybe we can change the default value in this PR and the svg can be changed at the same time.

@zwangsheng
Copy link
Contributor Author

zwangsheng commented Aug 24, 2023

Nice, so maybe we can change the default value in this PR and the svg can be changed at the same time.

Updated. @waitinfuture

@@ -39,7 +39,7 @@ license: |
| celeborn.worker.directMemoryRatioForReadBuffer | 0.1 | Max ratio of direct memory for read buffer | 0.2.0 |
| celeborn.worker.directMemoryRatioToPauseReceive | 0.85 | If direct memory usage reaches this limit, the worker will stop to receive data from Celeborn shuffle clients. | 0.2.0 |
| celeborn.worker.directMemoryRatioToPauseReplicate | 0.95 | If direct memory usage reaches this limit, the worker will stop to receive replication data from other workers. This value should be higher than celeborn.worker.directMemoryRatioToPauseReceive. | 0.2.0 |
| celeborn.worker.directMemoryRatioToResume | 0.5 | If direct memory usage is less than this limit, worker will resume. | 0.2.0 |
| celeborn.worker.directMemoryRatioToResume | 0.7 | If direct memory usage is less than this limit, worker will resume. | 0.2.0 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's mention this change in the migration.md then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I will add this change to Upgrading from 0.3 to 0.4 part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zwangsheng Let's make it happen in 0.3.1

waitinfuture pushed a commit that referenced this pull request Aug 24, 2023
…se logic is reconstructed

Add a new `backpressure.svg` to replace the out-date one.

After #1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand.

Yes

![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94)

Closes #1829 from zwangsheng/CELEBORN-905.

Authored-by: zwangsheng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit 2ffd6d7)
Signed-off-by: zky.zhoukeyong <[email protected]>
@waitinfuture
Copy link
Contributor

Thanks! Merged to main/0.3 using #1832

waitinfuture added a commit that referenced this pull request Aug 24, 2023
…e` default value changed in main/0.4

### What changes were proposed in this pull request?
As title

### Why are the changes needed?
After #1829 we set `celeborn.worker.directMemoryRatioToResume` default value from `0.5` to `0.7`.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
No

Closes #1836 from zwangsheng/CELEBORN-909.

Lead-authored-by: zwangsheng <[email protected]>
Co-authored-by: Keyong Zhou <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
zwangsheng added a commit to zwangsheng/RemoteShuffleService that referenced this pull request Aug 28, 2023
…se logic is reconstructed

Add a new `backpressure.svg` to replace the out-date one.

After apache#1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand.

Yes

![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94)

Closes apache#1829 from zwangsheng/CELEBORN-905.

Authored-by: zwangsheng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
zhouyifan279 pushed a commit to zhouyifan279/incubator-celeborn that referenced this pull request Aug 28, 2023
…e` default value changed in main/0.4

### What changes were proposed in this pull request?
As title

### Why are the changes needed?
After apache#1829 we set `celeborn.worker.directMemoryRatioToResume` default value from `0.5` to `0.7`.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
No

Closes apache#1836 from zwangsheng/CELEBORN-909.

Lead-authored-by: zwangsheng <[email protected]>
Co-authored-by: Keyong Zhou <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants