Support for Python models in dbt-spark #407
Replies: 2 comments 4 replies
-
Hi @jtcohen6, let me start with where I am standing on the python models: at first sight, I am not a fan. I like that dbt is constrained to SQL, I foresee more misuse by introducing Python. Also, I think Python models clutter the code base, therefore make the code base less maintainable. However ... let's put that aside, I want you to know my initial thoughts, I am open to be convinced otherwise. I do see some use cases for it. For example, predicting/evaluation of machine learning models. Those are kinda like transformations of input data models to an output data model. (I would constrain it to predicting/evaluation at first, leave the training to another system.) I had a look at #377, gave some feedback there, it makes sense to me to get a first version out, even though it is Databricks only, that's ok. @pgoslatara and I had a discussion about the implementation. Our biggest question is: over which endpoint to send the Python code. This depends on how Spark is hosted. The first version uses jobs in Databricks and assumes an interactive cluster. Spark is hosted in many ways, I do not know about all the protocols that exist and how/when to assume what is and isn't enabled. Could we extend the |
Beta Was this translation helpful? Give feedback.
-
I find that current implementation of dbt-spark's table materialization where it finally writes dataframe ( )very raw and not ready for production at all. What about delegating dataframe write to a user? Either make What do you think? I know this is very different from current design. |
Beta Was this translation helpful? Give feedback.
-
We are planning to merge #377 into the
main
branch ofdbt-spark
, for inclusion in our upcoming beta release (v1.3.0b1) of Python model functionality.The initial implementation uses a Databricks-specific API, which is not available to users of open source Apache Spark. I want to explain our reasoning behind doing this.
Why not a generic implementation, for all users of Apache Spark?
dbt-spark
community, many of whom are much savvier in Apache Spark development than we are, might be interested in helping out :)Why the code in
dbt-spark
, rather thandbt-databricks
?dbt-spark
anddbt-databricks
, anddbt-databricks
isn't yet supported in dbt Cloud. If the code lives indbt-spark
, it can be available to all Databricks users, whether they're usingdbt-spark
ordbt-databricks
today.Longer term
In time, we believe that:
dbt-spark
should contain "baseline" functionality supported in all Apache Spark deploymentsdbt-databricks
should be the rightful home for Databricks-specific functionality/implementationsOnce there's parity across both plugins, in terms of both their capabilities and their ease of access, we will work with the Databricks team to develop a plan that:
dbt-spark
plugin that is unique to the Databricks runtimedbt-databricks
dbt-spark
, with clear messaging to encourage relevant users to adopt the dedicateddbt-databricks
pluginThe
dbt-spark
adapter will continue to serve users of OSS Apache Spark, and as a foundation for "extension" plugins that serve dedicated Spark runtimes. That includesdbt-databricks
today, and it could also include more-recent projects such asdbt-spark-livy
(!).Beta Was this translation helpful? Give feedback.
All reactions