Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-882][WORKER][METRICS] Add Pause Push Data Time Count Metrics & Dashboard Panel #1800

Closed
wants to merge 6 commits into from

Conversation

zwangsheng
Copy link
Contributor

@zwangsheng zwangsheng commented Aug 8, 2023

What changes were proposed in this pull request?

Add PausePushDataTime Metrics

Why are the changes needed?

Count each celeborn worker pause time.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Cluster Test

@zwangsheng zwangsheng requested review from pan3793 and FMX August 8, 2023 10:07
@zwangsheng
Copy link
Contributor Author

截屏2023-08-08 18 08 16

@zwangsheng zwangsheng self-assigned this Aug 8, 2023
@codecov
Copy link

codecov bot commented Aug 8, 2023

Codecov Report

Merging #1800 (7286c42) into main (d7e900f) will increase coverage by 0.03%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1800      +/-   ##
==========================================
+ Coverage   46.49%   46.51%   +0.03%     
==========================================
  Files         164      164              
  Lines       10222    10228       +6     
  Branches      936      936              
==========================================
+ Hits         4752     4757       +5     
- Misses       5156     5157       +1     
  Partials      314      314              
Files Changed Coverage Δ
...cala/org/apache/celeborn/common/CelebornConf.scala 87.49% <100.00%> (+0.03%) ⬆️

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -65,7 +65,9 @@ public class MemoryManager {
private final AtomicLong diskBufferCounter = new AtomicLong(0);
private final LongAdder pausePushDataCounter = new LongAdder();
private final LongAdder pausePushDataAndReplicateCounter = new LongAdder();
private final AtomicLong pausePushDataTime = new AtomicLong(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LongAdder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -163,6 +166,7 @@ private MemoryManager(CelebornConf conf) {
logger.info("Trigger action: RESUME PUSH and REPLICATE");
memoryPressureListeners.forEach(
memoryPressureListener -> memoryPressureListener.onResume("all"));
pausePushDataTime.addAndGet(System.currentTimeMillis() - lastPauseTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state machine does not look correct to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESUM always be triggered after PAUSE_PUSH

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the worker goes into PUSH_AND_REPLICATE_PAUSED?

@zwangsheng zwangsheng marked this pull request as draft August 9, 2023 06:14
@zwangsheng
Copy link
Contributor Author

Marked as draft, should do some logic optimize before this.

@@ -140,6 +142,7 @@ private MemoryManager(CelebornConf conf) {
if (lastState != servingState) {
logger.info("Serving state transformed from {} to {}", lastState, servingState);
if (servingState == ServingState.PUSH_PAUSED) {
lastPauseTime = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PUSH_AND_REPLICATE_PAUSED will pause push data and Celeborn worker might changed to PUSH_AND_REPLICATE_PAUSED directly is memory pressure is high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got this. After the logic here clearly #1835 , will go on this pr.

@zwangsheng zwangsheng requested review from FMX and pan3793 August 28, 2023 02:39
@zwangsheng zwangsheng marked this pull request as ready for review August 28, 2023 02:39
@zwangsheng
Copy link
Contributor Author

CI fail related to #1844, will rebase and test agine.

long pauseSpendMills = System.currentTimeMillis() - pauseStartTime;
logger.info(
"Trigger action: RESUME PUSH and REPLICATE, pause push spent: " + pauseSpendMills);
pausePushDataTime.add(pauseSpendMills);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the corner case that a worker enters the pause state but never goes out, then pausePushDataTime will never increase. Maybe we can increase pausePushDataTime every N successive pause states, as well as state changes from pause to non_pause.

Also, I think we should consider worker under high load only when the pause time exceeds some threshold, instead of whenever it enters pause state, as #1840 does, cc @pan3793

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the corner case that a worker enters the pause state but never goes out, then pausePushDataTime will never increase. Maybe we can increase pausePushDataTime every N successive pause states, as well as state changes from pause to non_pause.

This is a good idea.

@waitinfuture
Copy link
Contributor

截屏2023-08-08 18 08 16

According to the snapshot, the metrics will monotonously increase. Is it better to define the metric as pause time in last xxx minutes?

@zwangsheng
Copy link
Contributor Author

@waitinfuture
Sorry for late reply, current add logic to force append pause spent time after N TRIM action.

As your comment to show pause time in last xxx minutes, IMO, users may care about how long the cluster in back-pressure state over this time. pause time in last xxx minutes can tell us state in specifies the time frame, but not tell us macroscopically about the pause for the whole duration.

Current PR implementation can do both of the things mentioned above.

If you have any other ideas, please let me know.

@@ -676,6 +676,8 @@ class CelebornConf(loadDefaults: Boolean) extends Cloneable with Logging with Se
def metricsAppTopDiskUsageCount: Int = get(METRICS_APP_TOP_DISK_USAGE_COUNT)
def metricsAppTopDiskUsageWindowSize: Int = get(METRICS_APP_TOP_DISK_USAGE_WINDOW_SIZE)
def metricsAppTopDiskUsageInterval: Long = get(METRICS_APP_TOP_DISK_USAGE_INTERVAL)
def metricsWorkerForceAppendPauseSpentTimeThreshold: Int =
get(METRICS_WORKER_PAUSE_SPENT_TINE_FORCE_APPEND_THRESHOLD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo METRICS_WORKER_PAUSE_SPENT_TIME_FORCE_APPEND_THRESHOLD

if (trimCounter.incrementAndGet() >= forceAppendPauseSpentTimeThreshold) {
logger.debug(
"Trigger action: TRIM for {} times, force to append pause spent time.",
trimCounter.incrementAndGet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get

@@ -65,8 +67,11 @@ public class MemoryManager {
private final AtomicLong diskBufferCounter = new AtomicLong(0);
private final LongAdder pausePushDataCounter = new LongAdder();
private final LongAdder pausePushDataAndReplicateCounter = new LongAdder();
private final LongAdder pausePushDataTime = new LongAdder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since switchServingState will be called sequentially in a single thread, I think it's safe to use long/int for pausePushDataTime and trimCounter

@zwangsheng
Copy link
Contributor Author

@waitinfuture Thanks for your review, made correction, PTAL

private volatile boolean isPaused = false;
private final AtomicInteger trimCounter = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does trimCounter need to be atomic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Merging to main/0.3

waitinfuture pushed a commit that referenced this pull request Sep 12, 2023
…ics & Dashboard Panel

### What changes were proposed in this pull request?
Add `PausePushDataTime ` Metrics

### Why are the changes needed?
Count each celeborn worker pause time.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster Test

Closes #1800 from zwangsheng/CELEBORN-882.

Lead-authored-by: zwangsheng <[email protected]>
Co-authored-by: zwangsheng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit 03a3981)
Signed-off-by: zky.zhoukeyong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants