Evaluating the Kedro and Databricks workflow #1653

yetudada · 2022-06-28T19:17:40Z

Introduction

We've entered the battleground of ML development workflows, a notebook-driven approach vs one primarily written using an IDE like VS Code, PyCharm and others. ⚔️

Why should ML developers use an IDE instead of a notebook to develop their data and ML pipelines?

The notebook-driven approach is challenged when producing a code base that needs to be maintained. In addition, it is challenging to write tests and documentation, leverage version control systems, sort out a pipeline's running order and collaborate with others when working with notebooks.

You don't need to take our word for this but rather reflect on the perspective of the Databricks Labs team that are aware of this problem:

"As projects on Databricks grow larger, Databricks users may struggle to keep up with the numerous notebooks containing ETL, data science experimentation, dashboards etc. While there are various short-term workarounds, such as using the %run command to call other notebooks from within your current notebook, it's useful to follow traditional software engineering best practices of separating reusable code from pipelines calling that code. Additionally, building tests around your pipelines to verify that the pipelines are also working is another important step toward production-grade development processes."

The Databricks Labs team have piloted multiple projects to allow users to leverage an IDE-based workflow, including CI/CD templates, Databricks Connect, dbx, Databricks Repos and the Databricks CLI.

Why does this affect Kedro?

Kedro suggests an IDE-based workflow and proposes that notebooks are suitable for prototyping, not the final code base. This tension often reveals itself when non-Kedro users primarily rely on Jupyter notebooks and when Kedro users interact with notebook-driven platforms like Databricks.

Why should we care?

We have a growing category of Kedro users that rely on Databricks to scale their data and machine-learning pipelines. We have also seen 341 queries related to Databricks on our Q&A forums; comparatively, there are 37 queries about AWS Sagemaker. This data results from a term search from our internal Slack channel and open Discord forum.

Our Databricks deployment documentation is also the most viewed in the deployment series. We also have some qualitative evidence to suggest that the current development and deployment experience is a barrier to adopting Kedro in organisations that rely on Databricks.

What is the scope of our work?

This exercise aims to define a seamless development and deployment experience for Kedro users on Databricks. We will target ML developers that prefer an IDE-based workflow, use Databricks to support their PySpark workflows and leverage Kedro to author their data and ML software; out of scope are users that solely rely on a notebook-based approach. We are exploring ways to help this second group in our improvements to the iPython and Jupyter Notebooks workflows.

The first part of our work will focus on a research assignment to understand the landscape of our users' problems related to the IDE workflow on Databricks and leveraging Kedro on Databricks. We'll use interviews to get an initial lay-of-the-land, a survey to source quantitative data and observation (screen recordings) or role-play studies to reproduce workflow errors.

What are we hoping to understand?

At the end of this research study:

We should have a prioritised list of pain points according to the following categories:
- The IDE workflow in Databricks,
- Kedro on Databricks,
- And potentially even just Kedro issues;
We also should understand our users' key workflows,
- Workarounds they may have created,
- And be able to communicate the value of using the Kedro and Databricks together.

This work will feed into the Kedro backlog and a hackathon that we will be planned with the Databricks team.

Who will we be speaking to?

Maria Olivia Lihn
Debanjan Banerjee
Diana Montanes
Roman Drapeko
Nishant Kumar-NKC
Saravanakumar Subramaniam
Benjamin Levy
Eduardo Coronado
Ingo Walz
Danny Farah
Poornima Ponthagani
Avaneesh Yembadi
Anil Chouldary
Logan Rupert
@marioFeynman
@Malaguth
@wolvez

The text was updated successfully, but these errors were encountered:

yetudada · 2022-07-01T08:47:43Z

Recruiting email

Introduction

We're moving into a new phase evaluating how to support platform-based Kedro workflows better. The highest priority platform to investigate based on usage is Databricks. Our goal is to determine what an optimal development and deployment experience would look like using the different parts of Databricks.

How can you help?

We're conducting interviews to understand pain points, and we will also ask for your participation in a later survey. We'll run a hackathon with Databricks to address some of the findings from this research.

We will ask a series of questions in the interview section and will need help understanding:

The reasons why you chose to use Kedro and Databricks together;
Your Kedro/Databricks workflow in the last project that you used the two tools together;
Pain points, errors or workarounds you ran into - it is helpful if you can provide stack traces, screenshots or even recordings;
Potential improvements for Kedro and even Databricks.

Why did you get this invite?

You are receiving this invite because you either indicated you were interested in helping us with this study, or we observed questions or comments about Databricks on our support channel.

yetudada · 2022-07-01T09:23:22Z

Interview questions

Introduction

Who are you, and what do you do at your company?
Can you tell me about the last three pieces of work you have been involved in?
- Have you used Databricks in your previous three projects?
- Have you used Kedro in your previous three projects?
Can you describe the last project you used Kedro on Databricks?
What version of Kedro did you use?
[Bonus] Would you be in a position to show us this project so that we can walk through it together?

I might verify if Q5 is possible before the interview.

Workflow on the last project that you used an IDE, Kedro and Databricks together

Databricks

What is your workflow with an IDE on Databricks?
Which parts of Databricks did you use on this project, and why?
Did you use Databricks with any other cloud platform tooling, e.g. Azure or AWS, on this project? And if "yes", what other parts did you use?

Use of Kedro on Databricks

Why did you use Kedro and Databricks together on this project?
What is your overall experience using Kedro on Databricks on this project?
- Did you run into any errors or challenges using Kedro on Databricks for this project?
- How did you solve these problems?
What should we do to improve the Kedro/Databricks workflow, and why?
If we improve this Kedro/Databricks workflow based on your recommendation, would you use Kedro on Databricks in the future?
Have you tried to use Kedro-Viz on this project?

Workflow

Can you describe what steps you took to set up your Kedro project on Databricks for this project?
Can you describe the steps you took when you changed your code base?
Can you describe what steps you took when you wanted to release a new version of your code?

Conclusion

What other challenges have you encountered with the Kedro and Databricks workflow?
Is there anything else you want to mention?

NeroOkwa · 2022-07-05T11:24:40Z

@yetudada additional interview questions may be:

What should we do to improve the Kedro/Databricks workflow ?
If we improve this Kedro/Databricks workflow based on your recommendation would you use Kedro on Databricks in the future?

Baukebrenninkmeijer · 2022-07-07T09:06:54Z

@yetudada is this issue mainly intended as administration or would you also like user responses here?

yetudada · 2022-07-07T13:08:07Z

@yetudada is this issue mainly intended as administration or would you also like user responses here?

The GitHub issue is meant to capture and summarise the research. However, would you be up for an interview with the team? We'd love to hear what you want to say! And thank you so much for reaching out!

Baukebrenninkmeijer · 2022-07-15T13:25:49Z

@yetudada Sure! Let me know when and where!

yetudada · 2022-07-15T13:34:20Z

@yetudada Sure! Let me know when and where!

Hit me with your email address? I'll get things organised 🥳

Baukebrenninkmeijer · 2022-07-26T08:02:11Z

@yetudada [email protected]

vitoravancini · 2022-10-25T17:08:24Z

Hello Kedro People, was this evolved somewhere/somehow?
I'm very interested in this topic, at the company I work at we've been using kedro-connect with fair success, but it seems databricks wont continue it, has anyone tried with dbx?

roumail · 2022-10-27T11:42:18Z

Hi everyone, I'm also just coming across this issue and being a fan of using kedro, I'm now looking into how I can use it in an azure databricks environment. I could be wrong but since I'm not able to develop locally, dbx won't work for my use case. Eager to see conclusions of this workstream!

@yetudada or the kedro team - I'm already on the kedro discord channel. Could you please let me know if there's someplace I can pick up the discussion? Thanks!

carlaprv · 2022-11-24T22:30:32Z

Hi everyone! I've just came across this issue. I've used kedro and databricks in my last QB project and would be happy to help with the interviews.

mlussati · 2022-12-06T12:00:37Z

Hi everyone! Do you know if there was an evolution related to dbx?

yetudada · 2022-12-08T13:17:55Z

Hi @Baukebrenninkmeijer, @vitoravancini, @roumail, @carlaprv and @mlussati. You can head through to #2105 which summaries this research; therefore, I'll close this issue.

Please also join our Slack workspace there are a few users of dbx there too that you can talk to: https://slack.kedro.org/

yetudada added the Stage: User Research 🔬 label Jun 28, 2022

yetudada added this to Roadmap Jun 28, 2022

yetudada moved this to Now - Discovery or Research in Roadmap Jun 28, 2022

yetudada mentioned this issue Jul 27, 2022

Evaluating Kedro-Viz adoption kedro-org/kedro-viz#987

Closed

yetudada mentioned this issue Jan 26, 2023

Snowflake Data Connectors (SnowPark) kedro-org/kedro-plugins#108

Closed

yetudada moved this from Discovery or Research - Now ⏳ to Shipped 🚀 in Roadmap Nov 28, 2022

yetudada mentioned this issue Dec 8, 2022

Research synthesis on evaluating the Kedro and Databricks workflow #2105

Closed

yetudada closed this as completed Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating the Kedro and Databricks workflow #1653

Evaluating the Kedro and Databricks workflow #1653

yetudada commented Jun 28, 2022 •

edited

Loading

yetudada commented Jul 1, 2022

yetudada commented Jul 1, 2022 •

edited

Loading

NeroOkwa commented Jul 5, 2022

Baukebrenninkmeijer commented Jul 7, 2022

yetudada commented Jul 7, 2022

Baukebrenninkmeijer commented Jul 15, 2022

yetudada commented Jul 15, 2022

Baukebrenninkmeijer commented Jul 26, 2022

vitoravancini commented Oct 25, 2022

roumail commented Oct 27, 2022

carlaprv commented Nov 24, 2022

mlussati commented Dec 6, 2022

yetudada commented Dec 8, 2022

Evaluating the Kedro and Databricks workflow #1653

Evaluating the Kedro and Databricks workflow #1653

Comments

yetudada commented Jun 28, 2022 • edited Loading

Introduction

Why should ML developers use an IDE instead of a notebook to develop their data and ML pipelines?

Why does this affect Kedro?

Why should we care?

What is the scope of our work?

What are we hoping to understand?

Who will we be speaking to?

yetudada commented Jul 1, 2022

Recruiting email

Introduction

How can you help?

Why did you get this invite?

yetudada commented Jul 1, 2022 • edited Loading

Interview questions

Introduction

Workflow on the last project that you used an IDE, Kedro and Databricks together

Databricks

Use of Kedro on Databricks

Workflow

Conclusion

NeroOkwa commented Jul 5, 2022

Baukebrenninkmeijer commented Jul 7, 2022

yetudada commented Jul 7, 2022

Baukebrenninkmeijer commented Jul 15, 2022

yetudada commented Jul 15, 2022

Baukebrenninkmeijer commented Jul 26, 2022

vitoravancini commented Oct 25, 2022

roumail commented Oct 27, 2022

carlaprv commented Nov 24, 2022

mlussati commented Dec 6, 2022

yetudada commented Dec 8, 2022

yetudada commented Jun 28, 2022 •

edited

Loading

yetudada commented Jul 1, 2022 •

edited

Loading