Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DbtVirtualenvBaseOperator: Single virtualenv per task execution #1200

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

kesompochy
Copy link

Description

This PR optimizes the DbtVirtualenvBaseOperator by implementing virtualenv reuse within a single task execution. It reduces the overhead of creating new virtualenvs for each dbt command.

The DbtVirtualenvBaseOperator in [email protected] creates a temporary directory and prepares a virtualenv twice when virtualenv_dir is None and is_virtualenv_dir_temporary is True. This PR modifies it to create a directory and a virtualenv only once at the beginning of the run_command method, avoiding this overhead.

Additionally, I have added tests to ensure the directory for virtualenv will be deleted after the task execution. This is related to the issue reported in #958.

The changes include:

  • Reusing virtualenv in a single task execution
  • Improving temporary directory management
  • Adding tests to ensure proper handling of virtualenv directories (deletion or persistence)

Related Issue(s)

#958

Breaking Change?

I believe this is not a breaking change.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

- Reuse virtualenv in single task execution to reduce creation overhead
- Improve temporary directory management to use TemporaryDirectory when virtualenv_dir is set to None
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Sep 9, 2024
Copy link

netlify bot commented Sep 9, 2024

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit 3b97cb5
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/66f4fc138560530008fe2601

@dosubot dosubot bot added area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:performance Related to performance, like memory usage, CPU usage, speed, etc execution:virtualenv Related to Virtualenv execution environment labels Sep 9, 2024
Copy link

codecov bot commented Sep 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.75%. Comparing base (243acfd) to head (3b97cb5).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1200      +/-   ##
==========================================
+ Coverage   95.72%   95.75%   +0.03%     
==========================================
  Files          64       64              
  Lines        3672     3675       +3     
==========================================
+ Hits         3515     3519       +4     
+ Misses        157      156       -1     
Flag Coverage Δ
95.75% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 24, 2024
cmd: list[str],
env: dict[str, str | bytes | os.PathLike[Any]],
context: Context,
) -> FullOutputSubprocessResult | dbtRunnerResult:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this fail when dbt is not installed in the Airflow worker node?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback!

This PR intends to install dbt within the virtual environment by running pip install. I have also added tests to verify that the pip install command is executed within the prepared virtual environment, so that dbt is available even if it's not pre-installed on the Airflow worker node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:performance Related to performance, like memory usage, CPU usage, speed, etc execution:virtualenv Related to Virtualenv execution environment size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants