Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry dataproc transient errors #1275

Closed

Conversation

OSalama
Copy link

@OSalama OSalama commented Jul 8, 2024

resolves #1271
docs dbt-labs/docs.getdbt.com/# N/A

Problem

Python models that are configured to use dataproc serverless jobs will fail when the polling API call returns a transient 5xx error. The actual dataproc batch job will still run to completion successfully, and data becomes available in the table, but all downstream models will get skipped because dbt thinks it failed.

Solution

This PR solves the problem by adding exponential retry to the API polling calls, which will retry all transient API errors.

It also increases the time between polling calls while the dataproc job is still in a "pending" state. This is because Dataproc serverless batch jobs take at least 60 seconds to start up, which leads to a lot of additional polling, and increases the chance of getting a transient error.
This increase in time does mean that any models that normally complete within 10 seconds of the Dataproc job transitioning from "pending" to "running" will see increased runtime. In practice I would be very surprised if this were to cause any issues, as this is only impacting jobs that are submitted via dataproc serverless, which are expected to run longer than interactive/dedicated cluster methods.

I have run this in our production environment for 2 weeks with no further issues.
I generated a patch file by checking out 9efeb859d1ac21cab1bb6441acc06ec1328e4888 (release 1.8.1), modifying batch.py with my changes, then git diff batch.py > batch.py.patch
I then use the below Dockerfile to apply the patch on top of Python 3.9.8 & dbt-bigquery 1.8.1.

FROM python:3.9.8-slim-buster
RUN apt-get update -q && apt-get install -q -y patch
RUN pip install dbt-bigquery==1.8.1
COPY docker/batch.py.patch /tmp/batch.py.patch
RUN patch /usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/dataproc/batch.py < /tmp/batch.py.patch

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@OSalama OSalama requested a review from a team as a code owner July 8, 2024 17:45
Copy link

cla-bot bot commented Jul 8, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Omar Salama.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Dataproc serverless batch jobs take at least 60 seconds to
start up, which leads to a lot of meaningless polling, and
increases the chance of getting a transient error.
In the absolute worst case this will add 9 seconds to
a model's runtime, but I would be very surprised if
people are using Python models to do less than
10 seconds of processing, and doubt there will be
any real world impact by this change
@OSalama OSalama force-pushed the retry-dataproc-transient-errors branch from bf8d404 to 869ce02 Compare July 8, 2024 17:50
@cla-bot cla-bot bot added the cla:yes label Jul 8, 2024
@OSalama
Copy link
Author

OSalama commented Jul 8, 2024

I am finding it hard to reproduce the issue in a test environment where I'm just running a handful of Python models, but the production environment hits the problem daily (10s/100s of concurrent Python models). I'll be applying this patch to our prod environment for our next daily run, and if all goes well, I'll check off the relevant testing checklist items.

@colin-rogers-dbt colin-rogers-dbt added the community This PR is from a community member label Jul 11, 2024
@OSalama
Copy link
Author

OSalama commented Jul 29, 2024

I've been running this in our production environment for a couple weeks and have not seen any more cases of Python models incorrectly marked as failed.

I have run this in our production environment for 2 weeks with no further issues.
I generated a patch file by checking out 9efeb859d1ac21cab1bb6441acc06ec1328e4888 (release 1.8.1), modifying batch.py with my changes, then git diff batch.py > batch.py.patch
I then use the below Dockerfile to apply the patch on top of Python 3.9.8 & dbt-bigquery 1.8.1.

FROM python:3.9.8-slim-buster
RUN apt-get update -q && apt-get install -q -y patch
RUN pip install dbt-bigquery==1.8.1
COPY docker/batch.py.patch /tmp/batch.py.patch
RUN patch /usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/dataproc/batch.py < /tmp/batch.py.patch

@mikealfare
Copy link
Contributor

Thanks for the PR, all the research, and context @OSalama! I wound up running into this same issue when addressing retries in general in dbt-biguqery. I'll add you to the changelog on that PR (#1395) since you identified an issue and a fix, and because we really appreciate the time and effort you spend on this. In the meantime, feel free to comment on that PR if you feel it's not achieving the goal you set out for with this change. Fair warning, we're looking to merge the other PR soon. If there's still a gap, we can always submit another change. I'm going to close this one for now though.

@mikealfare mikealfare closed this Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes community This PR is from a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Dataproc (Python models) do not retry polling on 5xx errors
3 participants