Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression][SPIKE] Understand DataProc Batch Jobs Failure Scenarios #1327

Closed
colin-rogers-dbt opened this issue Aug 28, 2024 · 0 comments
Closed
Assignees
Labels
bug Something isn't working regression

Comments

@colin-rogers-dbt
Copy link
Contributor

colin-rogers-dbt commented Aug 28, 2024

Current Behavior

Users have noted (see: #1157 ) inconsistent performance in production with python models on dataproc.

We need to map out when/why jobs on dataproc are failing and how dbt-bigquery should handle those scenarios (i.e. retry, raise a warning etc.)

Expected/Previous Behavior

Python model execution should be stable.

Environment

- OS:
- Python:
- dbt-core (working version):
- dbt-bigquery (working version):
- dbt-core (regression version):
- dbt-bigquery (regression version):

Additional Context

Some thoughts of what to try or what to add following the retry refactoring:

  • it looks like we might be hanging on the model upload to gcs since some folks are saying they don't see the job getting created
    • we take the defaults here (60s timeout)
    • we could be running into retention policy issues or permissions issues
    • we do not catch any errors here
  • folks reported issues specifically with serverless batch jobs hanging, we used to retry this with a custom polling method, but we now use the built in .result to wait for the operation to finish while supplying the same timeout config from the user; so this might be resolved
  • we do not catch any errors for serverless batch jobs nor cluster jobs
    • when we run sql models, we use an error handler on the connections class that routes certain errors to dbt errors
    • when we run sql models, we use retry strategies that retry particular errors that we identified as transient errors; google does warn against overriding the defaults for dataproc here
@colin-rogers-dbt colin-rogers-dbt added bug Something isn't working triage regression and removed triage labels Aug 28, 2024
@colin-rogers-dbt colin-rogers-dbt changed the title [Regression] Understand DataProc Batch Jobs Failure Scenarios [Regression][SPIKE] Understand DataProc Batch Jobs Failure Scenarios Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants