Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating the Kedro and Databricks workflow #1653

Closed
yetudada opened this issue Jun 28, 2022 · 13 comments
Closed

Evaluating the Kedro and Databricks workflow #1653

yetudada opened this issue Jun 28, 2022 · 13 comments
Labels
Stage: User Research 🔬 Ticket needs to undergo user research before implementation

Comments

@yetudada
Copy link
Contributor

yetudada commented Jun 28, 2022

Introduction

We've entered the battleground of ML development workflows, a notebook-driven approach vs one primarily written using an IDE like VS Code, PyCharm and others. ⚔️

Why should ML developers use an IDE instead of a notebook to develop their data and ML pipelines?

The notebook-driven approach is challenged when producing a code base that needs to be maintained. In addition, it is challenging to write tests and documentation, leverage version control systems, sort out a pipeline's running order and collaborate with others when working with notebooks.

You don't need to take our word for this but rather reflect on the perspective of the Databricks Labs team that are aware of this problem:

"As projects on Databricks grow larger, Databricks users may struggle to keep up with the numerous notebooks containing ETL, data science experimentation, dashboards etc. While there are various short-term workarounds, such as using the %run command to call other notebooks from within your current notebook, it's useful to follow traditional software engineering best practices of separating reusable code from pipelines calling that code. Additionally, building tests around your pipelines to verify that the pipelines are also working is another important step toward production-grade development processes."

The Databricks Labs team have piloted multiple projects to allow users to leverage an IDE-based workflow, including CI/CD templates, Databricks Connect, dbx, Databricks Repos and the Databricks CLI.

Why does this affect Kedro?

Kedro suggests an IDE-based workflow and proposes that notebooks are suitable for prototyping, not the final code base. This tension often reveals itself when non-Kedro users primarily rely on Jupyter notebooks and when Kedro users interact with notebook-driven platforms like Databricks.

Why should we care?

We have a growing category of Kedro users that rely on Databricks to scale their data and machine-learning pipelines. We have also seen 341 queries related to Databricks on our Q&A forums; comparatively, there are 37 queries about AWS Sagemaker. This data results from a term search from our internal Slack channel and open Discord forum.

Our Databricks deployment documentation is also the most viewed in the deployment series. We also have some qualitative evidence to suggest that the current development and deployment experience is a barrier to adopting Kedro in organisations that rely on Databricks.

What is the scope of our work?

This exercise aims to define a seamless development and deployment experience for Kedro users on Databricks. We will target ML developers that prefer an IDE-based workflow, use Databricks to support their PySpark workflows and leverage Kedro to author their data and ML software; out of scope are users that solely rely on a notebook-based approach. We are exploring ways to help this second group in our improvements to the iPython and Jupyter Notebooks workflows.

The first part of our work will focus on a research assignment to understand the landscape of our users' problems related to the IDE workflow on Databricks and leveraging Kedro on Databricks. We'll use interviews to get an initial lay-of-the-land, a survey to source quantitative data and observation (screen recordings) or role-play studies to reproduce workflow errors.

What are we hoping to understand?

At the end of this research study:

  • We should have a prioritised list of pain points according to the following categories:
    • The IDE workflow in Databricks,
    • Kedro on Databricks,
    • And potentially even just Kedro issues;
  • We also should understand our users' key workflows,
    • Workarounds they may have created,
    • And be able to communicate the value of using the Kedro and Databricks together.

This work will feed into the Kedro backlog and a hackathon that we will be planned with the Databricks team.

Who will we be speaking to?

  • Maria Olivia Lihn
  • Debanjan Banerjee
  • Diana Montanes
  • Roman Drapeko
  • Nishant Kumar-NKC
  • Saravanakumar Subramaniam
  • Benjamin Levy
  • Eduardo Coronado
  • Ingo Walz
  • Danny Farah
  • Poornima Ponthagani
  • Avaneesh Yembadi
  • Anil Chouldary
  • Logan Rupert
  • @marioFeynman
  • @Malaguth
  • @wolvez
@yetudada yetudada added the Stage: User Research 🔬 Ticket needs to undergo user research before implementation label Jun 28, 2022
@yetudada yetudada added this to Roadmap Jun 28, 2022
@yetudada yetudada moved this to Now - Discovery or Research in Roadmap Jun 28, 2022
@yetudada
Copy link
Contributor Author

yetudada commented Jul 1, 2022

Recruiting email

Introduction

We're moving into a new phase evaluating how to support platform-based Kedro workflows better. The highest priority platform to investigate based on usage is Databricks. Our goal is to determine what an optimal development and deployment experience would look like using the different parts of Databricks.

How can you help?

We're conducting interviews to understand pain points, and we will also ask for your participation in a later survey. We'll run a hackathon with Databricks to address some of the findings from this research.

We will ask a series of questions in the interview section and will need help understanding:

  • The reasons why you chose to use Kedro and Databricks together;
  • Your Kedro/Databricks workflow in the last project that you used the two tools together;
  • Pain points, errors or workarounds you ran into - it is helpful if you can provide stack traces, screenshots or even recordings;
  • Potential improvements for Kedro and even Databricks.

Why did you get this invite?

You are receiving this invite because you either indicated you were interested in helping us with this study, or we observed questions or comments about Databricks on our support channel.

@yetudada
Copy link
Contributor Author

yetudada commented Jul 1, 2022

Interview questions

Introduction

  1. Who are you, and what do you do at your company?
  2. Can you tell me about the last three pieces of work you have been involved in?
    • Have you used Databricks in your previous three projects?
    • Have you used Kedro in your previous three projects?
  3. Can you describe the last project you used Kedro on Databricks?
  4. What version of Kedro did you use?
  5. [Bonus] Would you be in a position to show us this project so that we can walk through it together?

I might verify if Q5 is possible before the interview.

Workflow on the last project that you used an IDE, Kedro and Databricks together

Databricks

  1. What is your workflow with an IDE on Databricks?
  2. Which parts of Databricks did you use on this project, and why?
  3. Did you use Databricks with any other cloud platform tooling, e.g. Azure or AWS, on this project? And if "yes", what other parts did you use?

Use of Kedro on Databricks

  1. Why did you use Kedro and Databricks together on this project?
  2. What is your overall experience using Kedro on Databricks on this project?
    • Did you run into any errors or challenges using Kedro on Databricks for this project?
    • How did you solve these problems?
  3. What should we do to improve the Kedro/Databricks workflow, and why?
  4. If we improve this Kedro/Databricks workflow based on your recommendation, would you use Kedro on Databricks in the future?
  5. Have you tried to use Kedro-Viz on this project?

Workflow

  1. Can you describe what steps you took to set up your Kedro project on Databricks for this project?
  2. Can you describe the steps you took when you changed your code base?
  3. Can you describe what steps you took when you wanted to release a new version of your code?

Conclusion

  1. What other challenges have you encountered with the Kedro and Databricks workflow?
  2. Is there anything else you want to mention?

@NeroOkwa
Copy link
Contributor

NeroOkwa commented Jul 5, 2022

@yetudada additional interview questions may be:

  1. What should we do to improve the Kedro/Databricks workflow ?
  2. If we improve this Kedro/Databricks workflow based on your recommendation would you use Kedro on Databricks in the future?

@Baukebrenninkmeijer
Copy link

@yetudada is this issue mainly intended as administration or would you also like user responses here?

@yetudada
Copy link
Contributor Author

yetudada commented Jul 7, 2022

@yetudada is this issue mainly intended as administration or would you also like user responses here?

The GitHub issue is meant to capture and summarise the research. However, would you be up for an interview with the team? We'd love to hear what you want to say! And thank you so much for reaching out!

@Baukebrenninkmeijer
Copy link

@yetudada Sure! Let me know when and where!

@yetudada
Copy link
Contributor Author

@yetudada Sure! Let me know when and where!

Hit me with your email address? I'll get things organised 🥳

@Baukebrenninkmeijer
Copy link

@yetudada [email protected]

@vitoravancini
Copy link

Hello Kedro People, was this evolved somewhere/somehow?
I'm very interested in this topic, at the company I work at we've been using kedro-connect with fair success, but it seems databricks wont continue it, has anyone tried with dbx?

@roumail
Copy link

roumail commented Oct 27, 2022

Hi everyone, I'm also just coming across this issue and being a fan of using kedro, I'm now looking into how I can use it in an azure databricks environment. I could be wrong but since I'm not able to develop locally, dbx won't work for my use case. Eager to see conclusions of this workstream!

@yetudada or the kedro team - I'm already on the kedro discord channel. Could you please let me know if there's someplace I can pick up the discussion? Thanks!

@carlaprv
Copy link
Contributor

Hi everyone! I've just came across this issue. I've used kedro and databricks in my last QB project and would be happy to help with the interviews.

@yetudada yetudada moved this from Discovery or Research - Now ⏳ to Shipped 🚀 in Roadmap Nov 28, 2022
@mlussati
Copy link

mlussati commented Dec 6, 2022

Hi everyone! Do you know if there was an evolution related to dbx?

@yetudada
Copy link
Contributor Author

yetudada commented Dec 8, 2022

Hi @Baukebrenninkmeijer, @vitoravancini, @roumail, @carlaprv and @mlussati. You can head through to #2105 which summaries this research; therefore, I'll close this issue.

Please also join our Slack workspace there are a few users of dbx there too that you can talk to: https://slack.kedro.org/

@yetudada yetudada closed this as completed Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: User Research 🔬 Ticket needs to undergo user research before implementation
Projects
Status: Shipped 🚀
Development

No branches or pull requests

7 participants