Added article

dlt-hub · Jan 15, 2024 · 608e983 · 608e983
1 parent 29e30fc
commit 608e983
Showing 1 changed file with 376 additions and 0 deletions.
diff --git a/docs/website/blog/2024-01-15-dlt-dbt-runner-on-cloud-functions.md b/docs/website/blog/2024-01-15-dlt-dbt-runner-on-cloud-functions.md
@@ -0,0 +1,376 @@
+---
+slug: dlt-dbt-runner-on-cloud-functions
+title: "Comparison running dbt-core and dlt-dbt runner on Google Cloud Functions"
+image:  https://storage.googleapis.com/dlt-blog-images/dlt-dbt-runner-on-cloud-functions.png
+authors:
+  name: Aman Gupta
+  title: Junior Data Engineer
+  url: https://github.com/dat-a-man
+  image_url: https://dlt-static.s3.eu-central-1.amazonaws.com/images/aman.png
+tags: [dbt, dlt-dbt-runner, cloud functions, ETL, data modeling]
+---
+
+:::info
+TL;DR: This article compares deploying dbt-core standalone and using dlt-dbt runner on Google Cloud Functions. The comparison covers various aspects, along with a step-by-step deployment guide.
+:::
+
+dbt or “data build tool” has become a standard for transforming data in analytical environments.
+Most data pipelines nowadays start with ingestion and finish with running a dbt package.
+
+dlt or “data load tool” is an open-source Python library for easily creating data ingestion
+pipelines. And of course, after ingesting the data, we want to transform it into an analytical
+model. For this reason, dlt offers a dbt runner that’s able to just run a dbt model on top of where
+dlt loaded the data, without setting up any additional things like dbt credentials.
+
+### Using dbt in Google Cloud functions
+
+To use dbt in cloud functions, we employed two methods:
+
+1. `dbt-core` on GCP cloud functions.
+1. `dlt-dbt runner` on GCP cloud functions.
+
+Let’s discuss these methods one by one.
+
+### 1. Deploying dbt-core on Google Cloud functions
+
+Let's dive into running dbt-core up on cloud functions.
+
+You should use this option for scenarios where you have already collected and housed your data in a
+data warehouse, and you need further transformations or modeling of the data. This is a good option
+if you have used dbt before and want to leverage the power of dbt-core. If you are new to dbt, please
+refer to dbt documentation: [Link Here.](https://docs.getdbt.com/docs/core/installation-overview)
+
+Let’s start with setting up the following directory structure:
+
+```text
+dbt_setup
+|-- main.py
+|-- requirements.txt
+|-- profiles.yml
+|-- dbt_project.yml
+|-- dbt_transform
+    |-- models
+    |   |-- model1.sql
+    |   |-- model2.sql
+    |   |-- sources.yml
+    |-- (other dbt related contents, if required)
+```
+
+> You can setup the contents in `dbt_transform` folder by initing a new dbt project, for details
+> refer to
+> [documentation.](https://docs.getdbt.com/reference/commands/init#:~:text=When%20using%20dbt%20init%20to,does%20not%20exist%20in%20profiles.)
+
+:::note
+We recommend setting up and testing dbt-core locally before using it in cloud functions.
+:::
+
+**To run dbt-core on GCP cloud functions:**
+
+1. Once you've tested the dbt-core package locally, update the `profiles.yml` before migrating the
+   folder to the cloud function as follows:
+
+   ```yaml
+   dbt_gcp: # project name
+     target: dev # environment
+     outputs:
+       dev:
+         type: bigquery
+         method: oauth
+         project: please_set_me_up! # your GCP project name
+         dataset: please_set_me_up! # your project dataset name
+         threads: 4
+         impersonate_service_account: please_set_me_up! # GCP service account
+   ```
+
+   > This service account should have bigquery read and write permissions.
+
+1. Next, modify the `main.py` as follows:
+
+   ```python
+   import os
+   import subprocess
+   import logging
+
+   # Configure logging
+   logging.basicConfig(level=logging.INFO)
+
+   def run_dbt(request):
+       try:
+           # Set your dbt profiles directory (assuming it's in /workspace)
+           os.environ['DBT_PROFILES_DIR'] = '/workspace/dbt_transform'
+
+           # Log the current working directory and list files
+           dbt_project_dir = '/workspace/dbt_transform'
+           os.chdir(dbt_project_dir)
+
+           # Log the current working directory and list files
+           logging.info(f"Current working directory: {os.getcwd()}")
+           logging.info(f"Files in the current directory: {os.listdir('.')}")
+
+           # Run dbt command (e.g., dbt run)
+
+           result = subprocess.run(
+               ['dbt', 'run'],
+               capture_output=True,
+               text=True
+               )
+
+           # Return dbt output
+           return result.stdout
+
+       except Exception as e:
+           logging.error(f"Error running dbt: {str(e)}")
+           return f"Error running dbt: {str(e)}"
+   ```
+
+1. Next, list runtime-installable modules in `requirements.txt`:
+
+   ```text
+   dbt-core
+   dbt-bigquery
+   ```
+
+1. Finally, you can deploy the function using gcloud CLI as:
+
+   ```shell
+   gcloud functions deploy YOUR_FUNCTION_NAME \
+   --gen2 \
+   --region=YOUR_REGION \
+   --runtime=python310 \
+   --source=YOUR_SOURCE_LOCATION \
+   --entry-point=YOUR_CODE_ENTRYPOINT \
+   TRIGGER_FLAGS
+   ```
+
+   > You have option to deploy the function via GCP Cloud Functions' GUI.
+
+### 2. Deploying function using dlt-dbt runner
+
+The second option is running dbt using data load tool(dlt).
+
+I work at dlthub and often create dlt pipelines. These often need dbt for modeling the data, making
+the dlt-dbt combination highly effective. For using this combination on cloud functions, we used
+[dlt-dbt runner](https://dlthub.com/docs/api_reference/helpers/dbt/runner#create_runner) developed
+at dlthub.
+
+The main reason I use this runner is because I load data with dlt and can re-use dlt’s connection to
+the warehouse to run my dbt package, saving me the time and code complexity I’d need to set up and
+run dbt standalone.
+
+To integrate dlt and dbt in cloud functions, use the dlt-dbt runner; here’s how:
+
+1. Lets start by creating the following directory structure:
+
+   ```text
+   dbt_setup
+   |-- main.py
+   |-- requirements.txt
+   |-- dbt_project.yml
+   |-- dbt_transform
+       |-- models
+       |   |-- model1.sql
+       |   |-- model2.sql
+       |   |-- sources.yml
+       |-- (other dbt related contents, if required)
+   ```
+
+   > You can set up the dbt by initing a new project, for details refer to
+   > [documentation](https://docs.getdbt.com/reference/commands/init#:~:text=When%20using%20dbt%20init%20to,does%20not%20exist%20in%20profiles.).
+
+   :::note
+   With the dlt-dbt runner configuration, setting up a `profiles.yml` is unnecessary. DLT seamlessly
+   shares credentials with dbt, and on Google Cloud Functions, it automatically retrieves service
+   account credentials, if none are provided.
+   :::
+
+1. Next, configure the `dbt_projects.yml` and set the model directory, for example:
+
+   ```yaml
+   model-paths: ["dbt_transform/models"]
+   ```
+
+1. Next, configure the `main.py` as follows:
+
+   ```python
+   import dlt
+   import logging, json
+   from flask import jsonify
+   from dlt.common.runtime.slack import send_slack_message
+
+   def run_pipeline(request):
+       """
+       Set up and execute a data processing pipeline, returning its status
+       and model information.
+
+       This function initializes a dlt pipeline with pre-defined settings,
+       runs the pipeline with a sample dataset, and then applies dbt
+       transformations. It compiles and returns the information about
+       each dbt model's execution.
+
+       Args:
+           request: The Flask request object. Not used in this function.
+
+       Returns:
+           Flask Response: A JSON response with the pipeline's status
+           and dbt model information.
+       """
+       try:
+           # Sample data to be processed
+           data = [{"name": "Alice Smith", "id": 1, "country": "Germany"},
+                   {"name": "Carlos Ruiz", "id": 2, "country": "Romania"},
+                   {"name": "Sunita Gupta", "id": 3, "country": "India"}]
+
+           # Initialize a dlt pipeline with specified settings
+           pipeline = dlt.pipeline(
+               pipeline_name="user_data_pipeline",
+               destination="bigquery",
+               dataset_name="dlt_dbt_test"
+               )
+
+           # Run the pipeline with the sample data
+           pipeline.run(data, table_name="sample_data")
+
+           # Apply dbt transformations and collect model information
+           models = transform_data(pipeline)
+           model_info = [
+               {
+                   "model_name": m.model_name,
+                   "time": m.time,
+                   "status": m.status,
+                   "message": m.message
+               }
+               for m in models
+           ]
+
+           # Convert the model information to a string
+           model_info_str = json.dumps(model_info)
+
+           # Send the model information to Slack
+           send_slack_message(
+               pipeline.runtime_config.slack_incoming_hook,
+               model_info_str
+           )
+
+           # Return a success response with model information
+           return jsonify({"status": "success", "model_info": model_info})
+       except Exception as e:
+           # Log and return an error response in case of any exceptions
+           logging.error(f"Error in running pipeline: {e}", exc_info=True)
+
+           return jsonify({"status": "error", "error": str(e)}), 500
+
+   def transform_data(pipeline):
+       """
+       Execute dbt models for data transformation within a dlt pipeline.
+
+       This function packages and runs all dbt models associated with the
+       pipeline, applying defined transformations to the data.
+
+       Args:
+           pipeline (dlt.Pipeline): The pipeline object for which dbt
+           transformations are run.
+
+       Returns:
+           list: A list of dbt model run information, indicating the
+           outcome of each model.
+
+       Raises:
+           Exception: If there is an error in running the dbt models.
+       """
+       try:
+           # Initialize dbt with the given pipeline and virtual environment
+           dbt = dlt.dbt.package(
+               pipeline,
+               "/workspace/dbt_transform",
+               venv=dlt.dbt.get_venv(pipeline)
+           )
+           logging.info("Running dbt models...")
+           # Run all dbt models and return their run information
+           return dbt.run_all()
+       except Exception as e:
+           # Log and re-raise any errors encountered during dbt model
+           # execution
+           logging.error(f"Error in running dbt models: {e}", exc_info=True)
+           raise
+
+   # Main execution block
+   if __name__ == "__main__":
+       # Execute the pipeline function.
+       run_pipeline(None)
+   ```
+
+1. The send_slack_message function is utilized for sending messages to Slack, triggered by
+   both success and error events. For setup instructions, please refer to the official
+   [documentation here.](https://dlthub.com/docs/running-in-production/running#using-slack-to-send-messages)
+   > `RUNTIME__SLACK_INCOMING_HOOK` was set up as environment variable in the above code.
+
+1. Next, list runtime-installable modules in `requirements.txt`:
+
+   ```
+   dbt-core
+   dbt-bigquery
+   ```
+
+1. Finally, you can deploy the function using gcloud CLI as:
+
+   ```shell
+   gcloud functions deploy YOUR_FUNCTION_NAME \
+   --gen2 \
+   --region=YOUR_REGION \
+   --runtime=python310 \
+   --source=YOUR_SOURCE_LOCATION \
+   --entry-point=YOUR_CODE_ENTRYPOINT \
+   TRIGGER_FLAGS
+   ```
+
+The merit of this method is that it can be used to load and transform data simultaneously. Using dlt
+for data loading and dbt for modeling makes using dlt-dbt a killer combination for data engineers
+and scientists, and my preferred choice. This method is especially effective for batched data and
+event-driven pipelines with small to medium workloads. For larger data loads nearing timeout limits,
+consider separating dlt and dbt into different cloud functions.
+
+> For more info on using `dlt-dbt runner` , please refer to the
+> [official documentation by clicking here.](https://dlthub.com/docs/api_reference/helpers/dbt/runner#dbtpackagerunner-objects)
+
+### Deployment considerations: How does cloud functions compare to Git Actions?
+
+At dlthub we already natively support deploying to GitHub Actions, enabling you to have a serverless setup with a 1-command deployment.
+
+GitHub actions is an orchestrator that most would not find suitable for a data warehouse setup - but
+it certainly could do the job for a minimalistic setup. GitHub actions provide 2000 free minutes per
+month, so if our pipelines run for 66 minutes per day, we fit in the free tier. If our pipelines
+took another 1h per day, we would need to pay ~15 USD/month for the smallest machine (2 vCPUs) but you
+can see how that would be expensive if we wanted to run it continuously or had multiple pipelines always-on in parallel.
+
+Cloud functions are serverless lightweight computing solutions that can handle small computational
+workloads and are cost-effective. dbt doesn't require the high computing power of the machine
+because it uses the computing power of the data warehouse to perform the transformations. This makes
+running dbt-core on cloud functions a good choice. The free tier would suffice for about 1.5h per
+day of running a 1 vCPU and 2 GB RAM machine, and if we wanted an additional 1h
+per day for this hardware it would cost us around 3-5 USD/month.
+
+![DLT-DBT-RUNNER_IMAGE](https://storage.googleapis.com/dlt-blog-images/dlt-dbt-runner-on-cloud-functions.png)
+
+When deploying dbt-core on cloud functions, there are certain constraints to keep in mind. For instance,
+there is a 9-minute time-out limit for all 1st Gen functions. For 2nd Gen functions, there is a 9-minute
+limit for event-driven functions and a 60-minute limit for HTTP functions. Since dbt works on the processing
+power of the data warehouse it's operating on, 60 minutes is sufficient for most cases with small to medium
+workloads. However, it is important to remember the 9-minute cap when using event-driven functions.
+
+
+### Conclusion
+
+When creating lightweight pipelines, using the two tools together on one cloud function makes a lot
+of sense, simplifying the setup process and the handover between loading and transformation.
+
+However, for more resource-intensive pipelines, we might want to improve resource utilisation by
+separating the dlt loading from the dbt running because while dbt’s run speed is determined by the
+database, dlt can utilize the cloud function’s hardware resources.
+
+When it comes to setting up just a dbt package to run on cloud functions, I guess it comes to
+personal preference: I prefer dlt as it simplifies credential management. It automatically shares
+credentials with dbt, making setup easier. Streamlining the process further, dlt on Google Cloud
+functions, efficiently retrieves service account credentials, when none are provided. I also
+used dlt’s [Slack error reporting function](https://dlthub.com/docs/running-in-production/running#using-slack-to-send-messages)
+that sends success and error notifications from your runs directly to your Slack channel,
+helping me manage and monitor my runs.