[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

yl09099 · 2024-12-19T02:33:14Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the bug

During the Write Stage retry phase, the MapOutputTrackerMaster clears the MapStatus corresponding to the shuffleId on the Driver side. However, when a large number of partitions are encountered, the MapStatus may not be completely cleared. Retry at the Stage, the task becomes less, resulting in data loss. At present, I encountered a 40000 Partition, resulting in data loss.
Below is a screenshot of my problem：

Affects Version(s)

0.10.0

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

yl09099 · 2024-12-19T02:34:40Z

@rickyma @jerqi @maobaolong @zuston @advancedxy Please help to look at this problem.

yl09099 · 2024-12-19T08:42:42Z

This is mainly because the long-tail task before the Stage retries, resulting in data loss.

yl09099 changed the title ~~[Bug] Incomplete clearing of Mapstatus in the Write Stage retry phase results in data loss.~~ [Bug] Long tail tasks in the Write Stage retry phase results in data loss. Dec 19, 2024

yl09099 linked a pull request Dec 19, 2024 that will close this issue

[#2300][Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

yl09099 commented Dec 19, 2024

yl09099 commented Dec 19, 2024

yl09099 commented Dec 19, 2024

[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

Comments

yl09099 commented Dec 19, 2024

Code of Conduct

Search before asking

Describe the bug

Affects Version(s)

Uniffle Server Log Output

Uniffle Engine Log Output

Uniffle Server Configurations

Uniffle Engine Configurations

Additional context

Are you willing to submit PR?

yl09099 commented Dec 19, 2024

yl09099 commented Dec 19, 2024