-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry dataproc transient errors #1275
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Omar Salama.
|
Dataproc serverless batch jobs take at least 60 seconds to start up, which leads to a lot of meaningless polling, and increases the chance of getting a transient error. In the absolute worst case this will add 9 seconds to a model's runtime, but I would be very surprised if people are using Python models to do less than 10 seconds of processing, and doubt there will be any real world impact by this change
bf8d404
to
869ce02
Compare
I am finding it hard to reproduce the issue in a test environment where I'm just running a handful of Python models, but the production environment hits the problem daily (10s/100s of concurrent Python models). I'll be applying this patch to our prod environment for our next daily run, and if all goes well, I'll check off the relevant testing checklist items. |
I've been running this in our production environment for a couple weeks and have not seen any more cases of Python models incorrectly marked as failed. I have run this in our production environment for 2 weeks with no further issues.
|
Thanks for the PR, all the research, and context @OSalama! I wound up running into this same issue when addressing retries in general in |
resolves #1271
docs dbt-labs/docs.getdbt.com/# N/A
Problem
Python models that are configured to use dataproc serverless jobs will fail when the polling API call returns a transient 5xx error. The actual dataproc batch job will still run to completion successfully, and data becomes available in the table, but all downstream models will get skipped because dbt thinks it failed.
Solution
This PR solves the problem by adding exponential retry to the API polling calls, which will retry all transient API errors.
It also increases the time between polling calls while the dataproc job is still in a "pending" state. This is because Dataproc serverless batch jobs take at least 60 seconds to start up, which leads to a lot of additional polling, and increases the chance of getting a transient error.
This increase in time does mean that any models that normally complete within 10 seconds of the Dataproc job transitioning from "pending" to "running" will see increased runtime. In practice I would be very surprised if this were to cause any issues, as this is only impacting jobs that are submitted via dataproc serverless, which are expected to run longer than interactive/dedicated cluster methods.
I have run this in our production environment for 2 weeks with no further issues.
I generated a patch file by checking out
9efeb859d1ac21cab1bb6441acc06ec1328e4888
(release 1.8.1), modifying batch.py with my changes, thengit diff batch.py > batch.py.patch
I then use the below Dockerfile to apply the patch on top of Python 3.9.8 & dbt-bigquery 1.8.1.
Checklist