Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

Open
3 tasks done
yl09099 opened this issue Dec 19, 2024 · 2 comments · May be fixed by #2301
Open
3 tasks done

[Bug] Long tail tasks in the Write Stage retry phase results in data loss. #2300

yl09099 opened this issue Dec 19, 2024 · 2 comments · May be fixed by #2301

Comments

@yl09099
Copy link
Contributor

yl09099 commented Dec 19, 2024

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

During the Write Stage retry phase, the MapOutputTrackerMaster clears the MapStatus corresponding to the shuffleId on the Driver side. However, when a large number of partitions are encountered, the MapStatus may not be completely cleared. Retry at the Stage, the task becomes less, resulting in data loss. At present, I encountered a 40000 Partition, resulting in data loss.
Below is a screenshot of my problem:
image

Affects Version(s)

0.10.0

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@yl09099
Copy link
Contributor Author

yl09099 commented Dec 19, 2024

@rickyma @jerqi @maobaolong @zuston @advancedxy Please help to look at this problem.

@yl09099 yl09099 changed the title [Bug] Incomplete clearing of Mapstatus in the Write Stage retry phase results in data loss. [Bug] Long tail tasks in the Write Stage retry phase results in data loss. Dec 19, 2024
@yl09099
Copy link
Contributor Author

yl09099 commented Dec 19, 2024

This is mainly because the long-tail task before the Stage retries, resulting in data loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant