Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Investigate GpuRand usage outside of ProjectExec #11649

Open
revans2 opened this issue Oct 23, 2024 · 1 comment
Open

[BUG] Investigate GpuRand usage outside of ProjectExec #11649

revans2 opened this issue Oct 23, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Oct 23, 2024

Describe the bug
#11646 was filed for an issue where GpuRand didn't work correctly in all cases. #11647 fixed the issue, but it exposed a problem where we are not doing a checkpoint restore retry in all locations that GpuRand could run. That means if we have a GpuRand outside of a regular project exec we might produce incorrect numbers on a retry.

#11647 (review)

So we need to do a few things for a complete solution

  1. We need to go through the Spark code and figure out what are all of the places that a non-deterministic expression could be run. We can do this by looking at all of the places that initialize is called on non-deterministic expressions.
  2. We need code changes so that if a retry happens on a non-deterministic expression that is outside of a checkpoint/restore, then we fail instead of retrying.
  3. We also want a way to detect a non-deterministic expression being run outside of a checkpoint/restore retry block and throw an error from the plan so that when we can have tests validate that we have this covered.
  4. We need a lot more tests to verify that we are doing the right thing with GpuRand.
@revans2 revans2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 23, 2024
@mattahrens
Copy link
Collaborator

Initial scope is trying to build a repro to understand the path that causes the issue.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants