Retry dataproc transient errors #1275

OSalama · 2024-07-08T17:45:31Z

resolves #1271
docs dbt-labs/docs.getdbt.com/# N/A

Problem

Python models that are configured to use dataproc serverless jobs will fail when the polling API call returns a transient 5xx error. The actual dataproc batch job will still run to completion successfully, and data becomes available in the table, but all downstream models will get skipped because dbt thinks it failed.

Solution

This PR solves the problem by adding exponential retry to the API polling calls, which will retry all transient API errors.

It also increases the time between polling calls while the dataproc job is still in a "pending" state. This is because Dataproc serverless batch jobs take at least 60 seconds to start up, which leads to a lot of additional polling, and increases the chance of getting a transient error.
This increase in time does mean that any models that normally complete within 10 seconds of the Dataproc job transitioning from "pending" to "running" will see increased runtime. In practice I would be very surprised if this were to cause any issues, as this is only impacting jobs that are submitted via dataproc serverless, which are expected to run longer than interactive/dedicated cluster methods.

I have run this in our production environment for 2 weeks with no further issues.
I generated a patch file by checking out 9efeb859d1ac21cab1bb6441acc06ec1328e4888 (release 1.8.1), modifying batch.py with my changes, then git diff batch.py > batch.py.patch
I then use the below Dockerfile to apply the patch on top of Python 3.9.8 & dbt-bigquery 1.8.1.

FROM python:3.9.8-slim-buster
RUN apt-get update -q && apt-get install -q -y patch
RUN pip install dbt-bigquery==1.8.1
COPY docker/batch.py.patch /tmp/batch.py.patch
RUN patch /usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/dataproc/batch.py < /tmp/batch.py.patch

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

cla-bot · 2024-07-08T17:45:34Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Omar Salama.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Dataproc serverless batch jobs take at least 60 seconds to start up, which leads to a lot of meaningless polling, and increases the chance of getting a transient error. In the absolute worst case this will add 9 seconds to a model's runtime, but I would be very surprised if people are using Python models to do less than 10 seconds of processing, and doubt there will be any real world impact by this change

OSalama · 2024-07-08T17:54:26Z

I am finding it hard to reproduce the issue in a test environment where I'm just running a handful of Python models, but the production environment hits the problem daily (10s/100s of concurrent Python models). I'll be applying this patch to our prod environment for our next daily run, and if all goes well, I'll check off the relevant testing checklist items.

OSalama · 2024-07-29T08:44:49Z

I've been running this in our production environment for a couple weeks and have not seen any more cases of Python models incorrectly marked as failed.

I have run this in our production environment for 2 weeks with no further issues.
I generated a patch file by checking out 9efeb859d1ac21cab1bb6441acc06ec1328e4888 (release 1.8.1), modifying batch.py with my changes, then git diff batch.py > batch.py.patch
I then use the below Dockerfile to apply the patch on top of Python 3.9.8 & dbt-bigquery 1.8.1.

FROM python:3.9.8-slim-buster
RUN apt-get update -q && apt-get install -q -y patch
RUN pip install dbt-bigquery==1.8.1
COPY docker/batch.py.patch /tmp/batch.py.patch
RUN patch /usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/dataproc/batch.py < /tmp/batch.py.patch

mikealfare · 2024-11-09T01:18:46Z

Thanks for the PR, all the research, and context @OSalama! I wound up running into this same issue when addressing retries in general in dbt-biguqery. I'll add you to the changelog on that PR (#1395) since you identified an issue and a fix, and because we really appreciate the time and effort you spend on this. In the meantime, feel free to comment on that PR if you feel it's not achieving the goal you set out for with this change. Fair warning, we're looking to merge the other PR soon. If there's still a gap, we can always submit another change. I'm going to close this one for now though.

OSalama requested a review from a team as a code owner July 8, 2024 17:45

OSalama added 3 commits July 8, 2024 18:50

Pass Retry object to retry transient errors

fab7f11

Add changelog entry

869ce02

OSalama force-pushed the retry-dataproc-transient-errors branch from bf8d404 to 869ce02 Compare July 8, 2024 17:50

cla-bot bot added the cla:yes label Jul 8, 2024

colin-rogers-dbt added the community This PR is from a community member label Jul 11, 2024

OSalama mentioned this pull request Jul 29, 2024

[Bug] Dataproc (Python models) do not retry polling on 5xx errors #1271

Open

2 tasks

mikealfare closed this Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry dataproc transient errors #1275

Retry dataproc transient errors #1275

OSalama commented Jul 8, 2024 •

edited

Loading

cla-bot bot commented Jul 8, 2024

OSalama commented Jul 8, 2024

OSalama commented Jul 29, 2024 •

edited

Loading

mikealfare commented Nov 9, 2024

Retry dataproc transient errors #1275

Retry dataproc transient errors #1275

Conversation

OSalama commented Jul 8, 2024 • edited Loading

Problem

Solution

Checklist

cla-bot bot commented Jul 8, 2024

OSalama commented Jul 8, 2024

OSalama commented Jul 29, 2024 • edited Loading

mikealfare commented Nov 9, 2024

OSalama commented Jul 8, 2024 •

edited

Loading

OSalama commented Jul 29, 2024 •

edited

Loading