-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluating the Kedro and Databricks workflow #1653
Comments
Recruiting emailIntroductionWe're moving into a new phase evaluating how to support platform-based Kedro workflows better. The highest priority platform to investigate based on usage is Databricks. Our goal is to determine what an optimal development and deployment experience would look like using the different parts of Databricks. How can you help?We're conducting interviews to understand pain points, and we will also ask for your participation in a later survey. We'll run a hackathon with Databricks to address some of the findings from this research. We will ask a series of questions in the interview section and will need help understanding:
Why did you get this invite?You are receiving this invite because you either indicated you were interested in helping us with this study, or we observed questions or comments about Databricks on our support channel. |
Interview questionsIntroduction
Workflow on the last project that you used an IDE, Kedro and Databricks togetherDatabricks
Use of Kedro on Databricks
Workflow
Conclusion
|
@yetudada additional interview questions may be:
|
@yetudada is this issue mainly intended as administration or would you also like user responses here? |
The GitHub issue is meant to capture and summarise the research. However, would you be up for an interview with the team? We'd love to hear what you want to say! And thank you so much for reaching out! |
@yetudada Sure! Let me know when and where! |
Hit me with your email address? I'll get things organised 🥳 |
Hello Kedro People, was this evolved somewhere/somehow? |
Hi everyone, I'm also just coming across this issue and being a fan of using kedro, I'm now looking into how I can use it in an azure databricks environment. I could be wrong but since I'm not able to develop locally, dbx won't work for my use case. Eager to see conclusions of this workstream! @yetudada or the kedro team - I'm already on the kedro discord channel. Could you please let me know if there's someplace I can pick up the discussion? Thanks! |
Hi everyone! I've just came across this issue. I've used kedro and databricks in my last QB project and would be happy to help with the interviews. |
Hi everyone! Do you know if there was an evolution related to dbx? |
Hi @Baukebrenninkmeijer, @vitoravancini, @roumail, @carlaprv and @mlussati. You can head through to #2105 which summaries this research; therefore, I'll close this issue. Please also join our Slack workspace there are a few users of |
Introduction
We've entered the battleground of ML development workflows, a notebook-driven approach vs one primarily written using an IDE like VS Code, PyCharm and others. ⚔️
Why should ML developers use an IDE instead of a notebook to develop their data and ML pipelines?
The notebook-driven approach is challenged when producing a code base that needs to be maintained. In addition, it is challenging to write tests and documentation, leverage version control systems, sort out a pipeline's running order and collaborate with others when working with notebooks.
You don't need to take our word for this but rather reflect on the perspective of the Databricks Labs team that are aware of this problem:
The Databricks Labs team have piloted multiple projects to allow users to leverage an IDE-based workflow, including CI/CD templates, Databricks Connect,
dbx
, Databricks Repos and the Databricks CLI.Why does this affect Kedro?
Kedro suggests an IDE-based workflow and proposes that notebooks are suitable for prototyping, not the final code base. This tension often reveals itself when non-Kedro users primarily rely on Jupyter notebooks and when Kedro users interact with notebook-driven platforms like Databricks.
Why should we care?
We have a growing category of Kedro users that rely on Databricks to scale their data and machine-learning pipelines. We have also seen 341 queries related to Databricks on our Q&A forums; comparatively, there are 37 queries about AWS Sagemaker. This data results from a term search from our internal Slack channel and open Discord forum.
Our Databricks deployment documentation is also the most viewed in the deployment series. We also have some qualitative evidence to suggest that the current development and deployment experience is a barrier to adopting Kedro in organisations that rely on Databricks.
What is the scope of our work?
This exercise aims to define a seamless development and deployment experience for Kedro users on Databricks. We will target ML developers that prefer an IDE-based workflow, use Databricks to support their PySpark workflows and leverage Kedro to author their data and ML software; out of scope are users that solely rely on a notebook-based approach. We are exploring ways to help this second group in our improvements to the iPython and Jupyter Notebooks workflows.
The first part of our work will focus on a research assignment to understand the landscape of our users' problems related to the IDE workflow on Databricks and leveraging Kedro on Databricks. We'll use interviews to get an initial lay-of-the-land, a survey to source quantitative data and observation (screen recordings) or role-play studies to reproduce workflow errors.
What are we hoping to understand?
At the end of this research study:
This work will feed into the Kedro backlog and a hackathon that we will be planned with the Databricks team.
Who will we be speaking to?
The text was updated successfully, but these errors were encountered: