Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvements in the IPython and Jupyter Notebook workflows #1075

Closed
merelcht opened this issue Nov 29, 2021 · 14 comments
Closed

Minor improvements in the IPython and Jupyter Notebook workflows #1075

merelcht opened this issue Nov 29, 2021 · 14 comments

Comments

@merelcht
Copy link
Member

merelcht commented Nov 29, 2021

Context

From our experience in supporting our users as well as from simply reading our guide on the integration with IPython and Jupyter, we know that there are a number of challenges for users to work with Kedro from notebooks.

  • There are many ways to do the same thing
  • The .ipython/ folder in our projects makes our templates more cluttered and incomprehensible
  • It is harder to maintain backwards compatibility when our IPython/Jupyter workflow relies on template code under .ipython/
  • Our kedro ipython, kedro jupyter lab/notebook helpers don't work for managed Jupyter instances
  • For managed Jupyter instances, our users need to manually add extra scripts like ipython_loader.py
  • Our users have reportedly made custom scripts to cater for common workflows like preloading all dataset inputs for a specific node
  • Converting Jupyter Notebook code to Kedro nodes is still primarily done manually despite our kedro jupyter notebook convert CLI command

These challenges are not exhaustive, but they arguably present a significant barrier for Jupyter Notebook users interacting with Kedro and make up for an unpleasant experience.

Proposal

In order to improve the experience without major changes in Kedro, not long ago we have started the development of a Kedro IPython extension which was meant to replace the startup script in the .ipython/ directory. The extension has a full feature parity already with the startup script for IPython sessions and after 7613dec it will be the primary way our IPython/Jupyter users will interact with Kedro.

As next steps, I suggest that we aim for the following unified workflow based entirely on our IPython extension:

IPython

If the user can start the session themselves:
cd <kedro-project-root>/
ipython --ext="kedro.extras.extensions.ipython"
If the user is in an existing IPython session they cannot or do not want to restart:
In [1]: %load_ext kedro.extras.extensions.ipython
In [2]: %reload_kedro <path_to_project_root>

Jupyter

For Jupyter, there will be only one way to load the extension and that will happen per notebook:

In [1]: %load_ext kedro.extras.extensions.ipython
In [2]: %reload_kedro <path_to_project_root>

This should work for both local Jupyter setup and managed Jupyter instances.

IPython and Jupyter with preloaded Kedro extension

A new Kedro command should be created which is meant to be run once and enable Kedro's extension in the user's ~/.ipython/ folder. All Jupyter and IPython sessions started after this will have the Kedro IPython extension preloaded.

kedro ipython-init

The command will be a top-level command, without the need of an existing Kedro project. The name of the command is up for debate.

After this, Kedro projects will no longer need to have an .ipython/ folder in them.

Future

Once we have successfully migrated the community away from the old way of interacting with Kedro from IPython and Jupyter, we can continue the development of the plugin and add the following capabilities

Running an IPython session with preloaded datasets for a node

After running this in a Kedro project

kedro ipython --node example_node

we can preload the datasets which are inputs to this node, thus allowing the user to debug their pipeline at a particular node. This functionality is something already in use by internal teams, although they have their own scripts to facilitate it.

Jupyter extension to allow node editing

Jupyter provides an API for custom content loading. We can use this API and develop a Kedro Jupyter Notebook Server extension, which will allow us to edit nodes from Jupyter and browse them through their Kedro node name rather than their filename. This is what will enable us to integrate Jupyter notebooks in Kedro Lab.

This extension is contingent on the existence of the IPython session with preloaded datasets for a node, which will make up for a seamless experience.

@merelcht merelcht added Type: Discussion pinned Issue shouldn't be closed by stale bot labels Nov 29, 2021
@merelcht
Copy link
Member Author

(Comment copied over, originally written by @lorenabalan )

I like this! [...] I have just 2 comments:

  • If we have sth like "ipython init" would it be worth considering having the opposite as well? i.e. giving the user a way to undo their kedro extension so it's not loaded for all ipython sessions. I don't think there are any errors being raised (unless they uninstall kedro before they clear up the extensions), but I'm wondering if it takes up extra time or can be seen as "polluting the environment".
  • The Jupyter extension to allow node editing sounds interesting, but I'm not sure I understand it or what the Jupyter API does. Would you mind giving a bit more detail here? This line in particular was also new to be:

    This is what will enable us to integrate Jupyter notebooks in Kedro Lab.

@merelcht
Copy link
Member Author

(Comment copied over, originally written by @AntonyMilneQB)

I'm a bit out of the loop here so might have missed some things, but as a frequent Jupyter (ex-?)user myself this is something that I'm very interested in so am going to throw in my opinion anyway... 😬

General reaction is that this sounds super awesome and like a huge improvement 🎉 😀 👍

Jupyter-kedro viz integration

The Jupyter extension to allow node editing sounds interesting, but I'm not sure I understand it or what the Jupyter API does. Would you mind giving a bit more detail here?

@lorenabalan Not speaking for Ivan here since I haven't discussed it with him, but for a long time I've thought that editing node code through Jupyter in kedro viz would be a killer feature. At the moment a very common workflow (myself included) is:

  1. kedro jupyter lab/notebook
  2. Run dataset_1 = catalog.load("dataset_1"); dataset_2 = catalog.load("dataset_1"); ...; dataset_n = catalog.load("dataset_n"). Sometimes I'd hack together a loop to automatically do this for all n datasets whose name match some pattern
  3. Develop code for a node in a long and messy notebook. N.B. the way data scientists do this is quite different from the way that a software engineer might write a pure function. It's a very iterative and interactive process that often involves plotting graphs, inspecting the datasets, etc. rather than just writing some "raw" code.
  4. Tidy up code, export it to a python file (as Ivan said, typically copy & paste rather than kedro jupyter notebook convert)
  5. Run your new node, debug in Jupyter notebook
  6. Go into the notebook again since you'll want to load up the output datasets, plot graphs of them, etc.

A much better version of this would be:

  1. (outside scope of this, just a super cool vision for future) Create node graphically directly in kedro viz by clicking "create node" button (or press n on keyboard) and selecting the datasets you want to have as input, entering a node name
  2. Click on the node in kedro viz. Currently we have the code panel that shows the function code; we would now also have something that enables you to enter Jupyter in the context of this node
  3. This would open up a Jupyter instance that has access to all the dataset_1 to dataset_n inputs of that node preloaded as variables
  4. Develop your node code in Jupyter and then click something to export the final code back to the suitable Python module

A while ago @limdauto had a couple of examples of how this could work in practice (Jupyter accessing node code in kedro viz).

A couple of comments for @idanov

  • "All Jupyter and IPython sessions started after kedro ipython-init will have the Kedro IPython extension preloaded." Does this mean the %load_ext part is effectively run already but the %reload_kedro part isn't? If yes then I would definitely support Joel's idea that we keep kedro jupyter as an alias that starts Jupyter and does the %reload_kedro part. Otherwise we're adding an extra step to what is almost certainly the the most common use case (user wants to start a kedro jupyter session from within their kedro project directory)
  • On "Running an IPython session with preloaded datasets for a node using kedro ipython --node example_node". Fully support this, but just to point out that it's much more common for a develop to use Jupyter rather than ipython to debug and develop node code, so I would see kedro jupyter labs/notebook --node example_node as more useful. The only time I'd use kedro ipython over kedro jupyter is to do a quick elementary checks on a dataset; any real development work is a million times easier in Jupyter than ipython.

@idanov
Copy link
Member

idanov commented Feb 11, 2022

To @lorenabalan:

If we have sth like "ipython init" would it be worth considering having the opposite as well? i.e. giving the user a way to undo their kedro extension so it's not loaded for all ipython sessions. I don't think there are any errors being raised (unless they uninstall kedro before they clear up the extensions), but I'm wondering if it takes up extra time or can be seen as "polluting the environment".

It is polluting the environment indeed, but unfortunately I couldn't find an "environment aware" IPython configuration, without having to change the command starting IPython. And unfortunately when Jupyter starts IPython, we have very little control on how it starts IPython, what environment variables it adds or command lines to IPython. Your suggestion to have a counter-command to kedro ipython-init makes sense. It could be something like kedro ipython --setup and then kedro ipython --restore-defaults or something like that.

The Jupyter extension to allow node editing sounds interesting, but I'm not sure I understand it or what the Jupyter API does. Would you mind giving a bit more detail here?

Sure, will provide some mockup video recordings in a following comment.

To @AntonyMilneQB, kedro jupyter is a fairly useless alias for people running things on already provisioned Jupyter instances like Databricks or just some EC2 with managed Jupyter Lab that users don't start themselves. Our current version of the extension has this line which already tries to locate the project path itself, so the second line is not needed as well if the extension is preloaded. Maybe it's not needed even when you load it within a notebook, as long as this notebook is located within the project (I can't recall how notebooks determine where's the current working directory).

For your second point about the usefulness of editing a node, in order to make this possible for Jupyter, we should make it possible for IPython first, because Jupyter is in a way just a frontend to IPython sessions.

As for authoring pipelines in Kedro Viz, I think we will eventually get there, but it will require quite a bit of time investment. The suggestions here are meant to be low-hanging fruits or at least things that will not break the current flow too much.

@idanov
Copy link
Member

idanov commented Feb 11, 2022

Here's what I meant by creating a custom ContentsManager in Jupyter. Here's what we can do currently in Kedro Viz:

pipeline-explorer-viz.mov

And here's what we could have if we create a custom ContentsManager and start Jupyter Lab in a Kedro project:

pipeline-explorer-jupyter.mov

As you can see, instead of folders, we can show pipeline namespaces, and instead of files, we can show node names and directly edit them. Making this work will get us very close to enabling the same directly in Kedro Viz, which will be a very nice addition and make Data Science workflows much easier than what we have currently.

@daniel-falk
Copy link
Contributor

This looks really nice! I will take a closer look at it later, for now I just have some comments about the current startup script for ipython (and notebooks). Will these issues be solved with the refactorization?

The first problem I have had is that if my catalog file contains an invalid dataset specifier and I start the ipython terminal every thing looks fine (except that the kedro specific variables are not specified in the help text). Trying to use e.g. catalog will result in a warning that the variable catalog is not defined. This is very confusing. Sometimes I also do not react on this directly so I start importing other libraries etc in the interpreter and then much later realizes that it did not initialize correctly. See example:
https://github.com/daniel-falk/kedro-video-example/tree/invalid-catalog

The same thing happens if I have defined my own datasets and there is an exception raised in the dataset code (e.g. a missing import which is not installed).

An even more confusing thing that I experienced was that the file .ipython/profile_default/ipython_config.py was somehow lost/corrupt and the .ipython directory was added to my .gitignore file (does not seem to be the case if I create a new kedro project now). This caused the python and jupyter interpreters to not have the %reload_kedro command (but it was still there in the help text). This was extremely hard to debug and it took me a long while to figure out that I needed to create a new dummy project to get the file, copy it to my .ipython folder and remove the directory from the .gitignore file so that I could check it into git. Perhaps this issue was caused at some point when upgrading kedro? Anyhow the issue is the same if the full .ipython folder is missing. See example here:
https://github.com/daniel-falk/kedro-video-example/tree/missing-magic

@yetudada
Copy link
Contributor

I'm encouraged by the direction of this work! We can create a seamless Kedro/Jupyter notebook workflow from start to supporting our users' debugging workflow.

I have left some comments. I also have a particular question that I will ask upfront: "How do we get around polluting users' iPython environments with Kedro?"

Starting a Kedro-instance of a Jupyter notebook with .ipython, kedro ipython, kedro jupyter lab/notebook and the ipython_loader.py

@idanov summarised quite a few issues when users start a Kedro-instance of a Jupyter notebook. Unfortunately, there are too many ways to accomplish this task. We assumed that users would use kedro ipython and kedro jupyter lab/notebook which may not be correct according to telemetry data. Success here would be finding a way to provide a single workflow to allow our users to do this.

Supporting a Notebook-driven debugging workflow

The primary use case for Kedro and Jupyter notebook users is debugging node outputs. So it's great to see the proposition for running an IPython session with preloaded datasets for a node.

kedro jupyter convert

User feedback and telemetry data suggest that we should deprecate and remove this command. kedro jupyter convert has been run 42 times since the 1st of September 2021. It's worth understanding the original objectives of the command:

  • Encourage users to apply convention or a template to their Jupyter notebooks to ease conversion into a Kedro project. This was an idea to address user adoption of Kedro in instances where users still wanted to use Jupyter notebooks as a primary development tool.
  • Automate the workflow of copy+pasting nodes into their Kedro project; but as @AntonyMilneQB has remarked, he just copies the code

Kedro-Viz & Jupyter integration

Users had a lot of concerns about this integration in our previous user testing with @hamzaoza. I would only be comfortable exploring this journey if we attempted another round of user testing.

Screenshot 2022-02-28 at 14 40 46

Screenshot 2022-02-28 at 14 41 16

Other things to consider

We should probably get rid of kedro activate nbstripout. And we should probably also have a look at the confusing errors that @daniel-falk has raised. @daniel-falk thank you so much for sharing what trouble you ran into!

@idanov
Copy link
Member

idanov commented Mar 9, 2022

As a way to progress forward on this one, we should look into the following steps:

  • Drop kedro activate-nbstripout as suggested by @yetudada
  • Drop kedro jupyter as it is rarely used and no longer needed or relevant
  • Drop kedro ipython as it is rarely used and no longer needed or relevant

Those command provide very little to the user anyway, since they are wrappers around calling ipython or jupyter and are not relevant for managed instances of Jupyter, which is probably the most common way our users use Jupyter (think of Databricks and other managed solutions).

Instead of having those commands, we should make sure that loading the Kedro extension is the only widely known alternative, as well as provide a very small number of steps for this to happen. So a set of other tasks need to be completed:

Some of those changes will be breaking changes and probably worth to try implementing them for Kedro 0.18 (to be discussed, since that might require us to add deprecation warnings in a small 0.17.8 release which is not ideal).

Me and @AntonyMilneQB will turn those steps into issues and put them on our backlog and once we complete them all, we'll revisit this discussion and see how we can build on that to provide even better Jupyter experience.

@antonymilne
Copy link
Contributor

Thanks for writing all this up Ivan! Very excited by where this is going. Just a few comments:

  • Drop kedro jupyter as it is rarely used and no longer needed or relevant
  • Drop kedro ipython as it is rarely used and no longer needed or relevant

Those command provide very little to the user anyway, since they are wrappers around calling ipython or jupyter and are not relevant for managed instances of Jupyter, which is probably the most common way our users use Jupyter (think of Databricks and other managed solutions).

Not sure I agree with this. According to the telemetry data, kedro jupyter notebook + kedro jupyter lab + kedro ipython = 664, about 2/3 of the usage of kedro viz. Comparing to kedro viz seems meaningful to me, since as as per my comment there it starts a long-running instance:

Just a quick note on telemetry data for kedro jupyter and kedro ipython - even though these might be executed relatively few times, the number of times the command is run isn't really comparable to the number of times a command like kedro lint is run. kedro jupyter starts a potentially long-running Jupyter session which the user can interact with over an extended period of time rather than just it just being a one-off process that exits when complete, like kedro lint. So you could be interacting heavily with Jupyter even though you run the command only once a week.

I do think that exposing kedro jupyter and kedro ipython commands has some other advantages too (it's easily discoverable, makes it feel nice and integrated with kedro, seems more obvious to me that this command would start up Jupyter with kedro variables pre-loaded). Do we have any evidence that managed instances of Jupyter are the most common way users use Jupyter?

@AntonyMilneQB made a good point that the catalog is the only useful one since running pipelines from notebooks is not a common pattern and probably shouldn't be done after we close #1313

I think this is too strong a statement and not quite what I meant. For a start, the pipelines variable I think is actually very useful (I use it all the time myself and can think of many reasons it's valuable). The main one I'd question is session, but as per this discussion I do think we should do some more user research before removing any of these as there could well be good uses for them that I'm not aware of.

Some of those changes will be breaking changes and probably worth to try implementing them for Kedro 0.18 (to be discussed, since that might require us to add deprecation warnings in a small 0.17.8 release which is not ideal).

Not sure about this - as per slack, I don't much like the idea of releasing 0.17.8 just before 0.18.0 just for the purpose of adding some deprecation warnings. One of the main motivations here is "there's only one way to work with jupyter/ipython" I know. But given that the only breaking changes are removing some commands I don't see the need to actually do that so long as their functionality is identical to our new workflow. Here's what I'd propose:

  • we develop the new workflows throughout 0.18. All these are just new bits of functionality rather than breaking changes
  • kedro jupyter and kedro ipython remain in 0.18 but become just thin wrappers for the new workflows so their functionality includes any new functionality (e.g. kedro jupyter runs kedro jupyter-init if required and then jupyter)
  • if we want to remove these commands (which I'm not totally convinced we should) then we add deprecation warnings for them during 0.18.x
  • then, if we want to, remove them in 0.19

A couple of questions to check my understanding:

  • if I want to reload my kedro variables (e.g. catalog), how do I do it? In jupyter, would restarting the kernel do this? In ipython, would I just call %load_ext again?
  • do we actually know whether it's possible to have a jupyter kernel automatically load an ipython extension without to tinker with an ipython profile? This is something we were hoping would be the case when we talked yesterday, but I don't know if it's confirmed or whether we should check it now

@idanov
Copy link
Member

idanov commented Mar 10, 2022

@AntonyMilneQB Totally makes sense, we can postpone the removal of those commands for 0.19 (if we still think that's needed at that time) and then the only breaking change left is the init_kedro line magic (which is superseded by reload_kedro).

To your questions:

  • reloading Kedro variables happens with the line magic reload_kedro
  • just checked, we can do that the same way we provide the extension to ipython through the command line

@antonymilne antonymilne added the Component: Jupyter/IPython Issue/PR relevant for Jupyter Notebooks, IPython sessions and the interactive workflow in Kedro label Mar 16, 2022
@WolVecz
Copy link

WolVecz commented Mar 18, 2022

I was asked to add a few things here for ideas:

If a pipeline or node is running in a Notebook (we use databricks so I don't want to specify Jupyter specifically) mode, it would be fantastic if in addition to the normally specified output of a node/pipeline run (defined by the configuration files), that the last node run also retains a memory dataset which can be used for debugging.

Additionally Kedro Viz needs to be edited to work with Partial functions. Right now we cannot use Kedro Viz for a lot of projects because we have various needs for Partial functions. The partial function breaks kedro viz from rendering anything.

@antonymilne
Copy link
Contributor

@WolVecz thanks for the comments. On the kedro-viz issue, I think this may actually now be fixed in kedro-org/kedro-viz#692 (not released yet).

@antonymilne antonymilne removed the Component: Jupyter/IPython Issue/PR relevant for Jupyter Notebooks, IPython sessions and the interactive workflow in Kedro label Jun 8, 2022
@astrojuanlu
Copy link
Member

The conversation here is old and long, but I see there are a few pending tasks in #1075 (comment), plus some ideas on how to add rich integrations of Kedro for Jupyter. For the former, do we want to evaluate any of that for 0.19 @merelcht ? And for the latter, should we consider opening separate issues or browsing the existing ones to see if they capture these ideas?

@astrojuanlu astrojuanlu removed the pinned Issue shouldn't be closed by stale bot label Sep 8, 2023
@merelcht
Copy link
Member Author

@astrojuanlu I think we can close this issue. We've done a lot of work here already and have some more specific issues open about debugging the ipython/jupyter workflow. I'd suggest getting those done early next year, but other than that I don't think further improvements are a priority now.

@astrojuanlu
Copy link
Member

Thanks, closing this as Done then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants