Support for Python models in dbt-spark #407

jtcohen6 · 2022-07-25T18:20:04Z

jtcohen6
Jul 25, 2022
Maintainer

We are planning to merge #377 into the main branch of dbt-spark, for inclusion in our upcoming beta release (v1.3.0b1) of Python model functionality.

The initial implementation uses a Databricks-specific API, which is not available to users of open source Apache Spark. I want to explain our reasoning behind doing this.

Why not a generic implementation, for all users of Apache Spark?

Using Databricks-specific APIs has significantly simplified the implementation. It's enabled rapid iteration for us during “alpha” development.
Crucially, the implementation makes it possible for dbt to push all user-provided Python code into the remote data warehouse, without executing any of it locally. This is a hard requirement for the v1 implementation of dbt Python models.
There is still work to do to get a native-Spark implementation for Python models. We believe that this would share much of the same "dbt implementation" as the current code. We hope that the talented and motivated members of the dbt-spark community, many of whom are much savvier in Apache Spark development than we are, might be interested in helping out :)

Why the code in `dbt-spark`, rather than `dbt-databricks`?

We want as many people as possible to be able to try out this functionality and give us feedback. Folks are still migrating between dbt-spark and dbt-databricks, and dbt-databricks isn't yet supported in dbt Cloud. If the code lives in dbt-spark, it can be available to all Databricks users, whether they're using dbt-spark or dbt-databricks today.
By leaving the full implementation in place, it should be that much easier for Spark-savvy community members to spec out what OSS Apache Spark support for Python models would require—and, in the process, help us modularize the code, before it's split across two repositories with already-settled interfaces.

Longer term

In time, we believe that:

dbt-spark should contain "baseline" functionality supported in all Apache Spark deployments
dbt-databricks should be the rightful home for Databricks-specific functionality/implementations

Once there's parity across both plugins, in terms of both their capabilities and their ease of access, we will work with the Databricks team to develop a plan that:

Identifies functionality in the dbt-spark plugin that is unique to the Databricks runtime
Ensures that functionality is operating at parity in dbt-databricks
Marks that functionality for deprecation in dbt-spark, with clear messaging to encourage relevant users to adopt the dedicated dbt-databricks plugin

The dbt-spark adapter will continue to serve users of OSS Apache Spark, and as a foundation for "extension" plugins that serve dedicated Spark runtimes. That includes dbt-databricks today, and it could also include more-recent projects such as dbt-spark-livy (!).

JCZuurmond · 2022-08-09T11:46:45Z

JCZuurmond
Aug 9, 2022
Collaborator

Hi @jtcohen6, let me start with where I am standing on the python models: at first sight, I am not a fan. I like that dbt is constrained to SQL, I foresee more misuse by introducing Python. Also, I think Python models clutter the code base, therefore make the code base less maintainable. However ... let's put that aside, I want you to know my initial thoughts, I am open to be convinced otherwise.

I do see some use cases for it. For example, predicting/evaluation of machine learning models. Those are kinda like transformations of input data models to an output data model. (I would constrain it to predicting/evaluation at first, leave the training to another system.)

I had a look at #377, gave some feedback there, it makes sense to me to get a first version out, even though it is Databricks only, that's ok. @pgoslatara and I had a discussion about the implementation. Our biggest question is: over which endpoint to send the Python code. This depends on how Spark is hosted. The first version uses jobs in Databricks and assumes an interactive cluster. Spark is hosted in many ways, I do not know about all the protocols that exist and how/when to assume what is and isn't enabled.

Could we extend the profiles.yml and add a Python profile? Maybe in the existing profiles.yml or in a separate one. Then, the user defines which profile (thus endpoint) to use for the Python models. In theory a user could supply a profile which communicates with a different Spark cluster, thus letting the Python model and/or dependencies fail because the data models are not shared across the clusters. However, we make the user responsible that this works. If we don't, we are responsible, I think it is technically challenging to assure that this works.

3 replies

jtcohen6 Aug 10, 2022
Maintainer Author

@JCZuurmond On the first bit: I hear you, loud and clear. "Do one thing and do it well" is a strong belief widely held, and for good reason. We've come to believe that dbt's "one thing" isn't "programming environment for SQL" so much as an opinionated framework for defining a DAG of datasets—a framework that isn't language-agnostic, but also not strictly limited to SQL. I think there are going to be users of dbt who want to keep using Jinja-SQL only, at least until we manage to convince them that Python presents them with an opportunity they cannot refuse.

For what it's worth, while support for Python has added some complexity to the codebases of dbt-core and the supporting adapters, it's less than you might think — and it's been a good exercise for us in thinking through how to modularize the plug-points that had been (still are) tightly coupled to Jinja + SQL.

As far as how this could work, from a technical perspective:

I definitely appreciate that there's a couple hundred different ways to do this in OSS Spark, which also depend on the hundreds of different ways it can be hosted
We've opened [CT-983] Support dbt Python models in OSS Apache Spark #415 to start a dedicated thread for potential implementations (knowing that they won't work for everybody / in all cases)
For unity of user experience, I like the idea of extending the fields currently offered in profiles.yml, rather than requiring users to specify a totally new profile. I'd like to think that many of the fields (such as host for remote clusters) would remain the same for both submitting SQL + submitting PySpark applications. But this is an educated guess based on reading docs, not based on my real-life experience.

Longer-term, we are interested in proposals such as "Spark Connect" (https://issues.apache.org/jira/browse/SPARK-39375) — a way to anonymously send up DataFrame code to a remote execution environment, without having to run an actual PySpark / "Databricks Connect" session locally (= within dbt).

JCZuurmond Aug 10, 2022
Collaborator

Yes, I agree, with extending the profiles.yml. @ChenyuLInx and I had a chat on Slack about it too.

For example for Databricks, I do not want to be forced to use interactive clusters only when using Python models. I want to have the option to run SQL models against a SQL endpoint and Python models against job clusters. Example profiles.yml would be:

your_profile_name:
  target: dev
  outputs:
    dev:
      type: databricks
      catalog: [optional catalog name, if you are using Unity Catalog, only available in dbt-databricks>=1.1.1]
      schema: [database/schema name]
      host: [your.databrickshost.com]
      http_path: [/sql/your/http/path]
      token: [dapiXXXXXXXXXXXXXXXXXXXXXXX]
      python:                                     <---- define settings for python models
          cluster_size: [1, 8]               <---- implies ...
          node_size: Standard8_2     <---- ... job cluster

@ChenyuLInx pointed me to a comment on BigQuery's data proc, where you make a similar suggestion. With the addition of allowing to specify (certain) connection details on the model level. I expect the config hierachy could be used here - it is slightly different since profiles.yml settings are overwritten instead of dbt_project.yml settings.

The expectation is that the user provides credentials to connect to a system that contains all the data models. By adding new fields to the existing profile it is more likely that this is the case.

Maybe it helps to make clear in #415 that extra options could be added to the profiles.yml.

jtcohen6 Sep 15, 2022
Maintainer Author

I want to have the option to run SQL models against a SQL endpoint and Python models against job clusters.

Agreed! Some more conversation about this over in #444 (comment)

(I realize there's also some overlap between what you're driving at in this thread, and the much bigger-picture conversation in dbt-labs/dbt-core#5758, which imagines that users might want to spin up Pythons-specific runtimes / adapters to run alongside and in combination with SQL runtimes / adapters.)

alex-hsp · 2023-01-30T18:31:29Z

alex-hsp
Jan 30, 2023

I find that current implementation of dbt-spark's table materialization where it finally writes dataframe (

dbt-spark/dbt/include/spark/macros/materializations/table.sql

Line 89 in 4d179e0

    
           df.write.mode("overwrite").format("delta").option("overwriteSchema", "true").saveAsTable("{{ target_relation }}")

)
very raw and not ready for production at all.

What about delegating dataframe write to a user? Either make def model() to return None, so user can do their own writing as they please, or make it return DataFrameWriter which they can configure, and dbt will call .save() or .saveAsTable() to trigger an action.

What do you think? I know this is very different from current design.

1 reply

ChenyuLInx Feb 17, 2023
Maintainer

@alex-hsp I agree the table materialization is very raw. Thanks for the suggestion and we can totally work on making it a better experience!

I think one of the aspect of dbt model is that we would want user to focus on define business logic and avoid having to do any DDL logic in the model. But maybe we can find a way to make defining some dataframe write logic easier to swap/config. That way you can maybe define a few dataframe write logic that is production ready in the project, then configure in the project to use those for you python models?

Do you mind share a few examples of what are the materialization code might look like so we can learn a bit more about it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Python models in dbt-spark #407

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support for Python models in dbt-spark #407

jtcohen6 Jul 25, 2022 Maintainer

Why not a generic implementation, for all users of Apache Spark?

Why the code in dbt-spark, rather than dbt-databricks?

Longer term

Replies: 2 comments · 4 replies

JCZuurmond Aug 9, 2022 Collaborator

jtcohen6 Aug 10, 2022 Maintainer Author

JCZuurmond Aug 10, 2022 Collaborator

jtcohen6 Sep 15, 2022 Maintainer Author

alex-hsp Jan 30, 2023

ChenyuLInx Feb 17, 2023 Maintainer

jtcohen6
Jul 25, 2022
Maintainer

Why the code in `dbt-spark`, rather than `dbt-databricks`?

Replies: 2 comments 4 replies

JCZuurmond
Aug 9, 2022
Collaborator

jtcohen6 Aug 10, 2022
Maintainer Author

JCZuurmond Aug 10, 2022
Collaborator

jtcohen6 Sep 15, 2022
Maintainer Author

alex-hsp
Jan 30, 2023

ChenyuLInx Feb 17, 2023
Maintainer