-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a docs page about adding code beyond starter files #3852
Changes from 13 commits
5f4330f
036cbe0
02f0274
5875e2d
4d7a35a
ca00b72
ff01643
b858406
144783b
cf3a65f
729dc1a
2d22bbf
a58ff32
2ed9639
325489d
1853e29
9fc168e
50a5bba
3024560
0df55ba
a9f0e3a
c02309c
6ee9ad2
8e811cf
fd96088
4310cf6
0e3dcf7
fe19a45
c20e741
a508fe0
52ed71e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# Adding code beyond starter files | ||
|
||
After you [create a Kedro project](../get_started/new_project.md) and | ||
[add a pipeline](../tutorial/create_a_pipeline.md), you notice that Kedro generates a | ||
few boilerplate files: `nodes.py`, `pipeline.py`, `pipeline_registry.py`... | ||
Check warning on line 5 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
|
||
While those may be sufficient for a small project, they quickly become large, hard to | ||
Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
read and collaborate on as your codebase grows. | ||
Those files also sometimes make new users think that Kedro requires code | ||
to be located only in those starter files, which is not true. | ||
Check warning on line 10 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This section elaborates what are the Kedro requirements in terms of organising code | ||
Check warning on line 12 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
in files and modules. | ||
It also provides examples of common scenarios such as sharing utilities between | ||
pipelines and using Kedro in a monorepo setup. | ||
Check warning on line 15 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
|
||
## Where does Kedro look for code to be located | ||
|
||
The only technical constraint for arranging code in the project is that `pipeline_registry.py` | ||
Check warning on line 19 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
file must be located in `<your_project>/src/<your_project>` directory, which is where | ||
it is created by default. | ||
|
||
This file must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]` | ||
mapping from pipeline name to corresponding `Pipeline` object. | ||
|
||
Other than that, **Kedro does not impose any constraints on where you should keep files with | ||
`Pipeline`s, `Node`s, or functions wrapped by `node`**. | ||
|
||
```{note} | ||
You actually can make Kedro look for pipeline registry in a different place by modifying the | ||
`__main__.py` file of your project, but such advanced customisations are not in scope for this section. | ||
``` | ||
astrojuanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This being the only constraint means that you can, for example: | ||
Check warning on line 34 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Add `utils.py` file to a pipeline folder and import utilities used by multiple | ||
Check warning on line 35 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We usually avoid There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @noklam how would you then call a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is many thing that I don't like about Even if we go with utils. I will strip the
I think it will be great to first introduce the principle of share module, what are the factors to consider. Then you can show this example. https://kedro-org.slack.com/archives/C03RKP2LW64/p1716912123397259 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
functions in `nodes.py` from there. | ||
* [Share modules between pipelines](#sharing-modules-between-pipelines). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be really beneficial if we have something to showcase, maybe there are some projects in Even just the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The example is good, but I think we can link https://github.com/kedro-org/awesome-kedro/blob/master/README.md#example-projects to direct people for more examples. It's also helpful to see real code There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I manually went through all Kedro projects listed there, I don't think there's any one that implements shared things between pipeline in a way that would be a good example to follow here. The closest one is this one: https://github.com/pablovdcf/TFM_HADO_Cares/tree/main/hado/src/hado |
||
Those could be utility functionalities, or your standalone module responsible for | ||
astrojuanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the domain logic of the industry you work at. | ||
astrojuanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* [Use Kedro in a monorepo setup](#kedro-project-in-a-monorepo-setup) if there are | ||
software components independent of Kedro that you want to keep together in the version control system. | ||
* Delete or rename a default `nodes.py` file, split it into multiple files or modules. | ||
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, | ||
into `historical_pipeline.py` and `inference_pipeline.py`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we usually go for the namespace/modular structure more:
@astrojuanlu thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that
Is that better? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good! |
||
* Instead of registering many pipelines in `register_pipelines()` function one by one, | ||
create a few `tp.Dict[str, Pipeline]` objects in different places of the project | ||
astrojuanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and then make `register_pipelines()` return a union of those. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My questions are:
I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the @astrojuanlu thought? |
||
|
||
## Common codebase extension scenarios | ||
|
||
This section provides examples of how you can handle some common cases of adding more | ||
code to or around your Kedro project. | ||
The provided examples are by no means the only ways to achieve the target scenarios, | ||
and serve only as illustrative purposes. | ||
|
||
### Sharing modules between pipelines | ||
|
||
Oftentimes you have machinery that has to be imported by multiple `pipelines`. | ||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same | ||
level as the `pipelines` folder**, and organise the functionalities there: | ||
|
||
```text | ||
├── conf | ||
├── data | ||
├── notebooks | ||
└── src | ||
├── my_project | ||
│ ├── __init__.py | ||
│ ├── __main__.py | ||
│ ├── pipeline_registry.py | ||
│ ├── settings.py | ||
│ ├── pipelines | ||
│ └── utils <-- Create a module to store your utilities | ||
│ ├── __init__.py <-- Required to import from it | ||
│ ├── pandas_utils.py <-- Put a file with utility functions here | ||
│ ├── dictionary_utils.py <-- Or a few files | ||
│ ├── visualisation_utils <-- Or sub-modules to organise even more utilities | ||
└── tests | ||
``` | ||
|
||
Example of importing a function `find_common_keys` from `dictionary_utils.py` would be: | ||
|
||
```python | ||
from my_project.utils.dictionary_utils import find_common_keys | ||
``` | ||
|
||
```{note} | ||
For imports like this to be displayed in IDE properly, it is required to perform an editable | ||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
installation of the Kedro project to your virtual environment. | ||
This is done via `pip install -e <root-of-kedro-project>`, the easiest way to achieve | ||
this is to `cd` to the root of your Kedro project and run `pip install -e .`. | ||
``` | ||
|
||
### Kedro project in a monorepo setup | ||
|
||
The way a Kedro project is generated may build an impression that it should | ||
only be acting as a root of a `git` repo. This is not true: just like you can combine | ||
multiple Python packages in a single repo, you can combine multiple Kedro projects. | ||
Or a Kedro project with other parts of your project's software stack. | ||
|
||
```{note} | ||
A practice of combining multiple, often unrelated software components in a single version | ||
control repository is not specific to Python and called [_**monorepo design**_](https://monorepo.tools/). | ||
``` | ||
|
||
A common use case of Kedro is that a software product built by a team has components that | ||
are well separable from the Kedro project. | ||
|
||
Let's use **a recommendation tool for production equipment operators** as an example. | ||
This example consists of three parts: | ||
|
||
| **#** | **Part** | **Considerations** | | ||
|-------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry | <ul> <li>Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.</li> </ul> | | ||
| 2 | An optimiser that leverages the ML model and implements domain business logic to derive recommendations | <ul> <li>A good design consideration might be to make it independent of the UI framework.</li> </ul> | | ||
| 3 | User interface (UI) application | <ul> <li>This can be a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.</li> <li>Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).</li> <li>Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.</li> </ul> | | ||
|
||
A suggested solution in this case would be a **monorepo** design. Below is an example: | ||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```text | ||
└── repo_root | ||
├── packages | ||
│ ├── kedro_project <-- A Kedro project for ML model training. | ||
│ │ ├── conf | ||
│ │ ├── data | ||
│ │ ├── notebooks | ||
│ │ ├── ... | ||
│ ├── optimizer <-- Standalone package. | ||
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline. | ||
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages. | ||
├── pyproject.toml <-- Settings for those, like `[tool.isort]`. | ||
astrojuanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
└── ... | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,4 +6,5 @@ | |
dependencies | ||
session | ||
settings | ||
code_beyond_starter_files | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the clarification, but I found this sounds like coming from an user rather than the official docs.
#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the
code_beyond_starter_files.md
?Or to make content of comments on that issue part of this new section of docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... I am not the best person to ask for English but I'll try my best🤓
I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and
find_pipeline
use that convention), but not mandatory to name filespipeline.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam Your examples is specifically around pipeline files, while I wanted to convey 2 things here:
nodes.py
/pipeline.py
files grow, they become challenging to manage.Do you disagree with those? I think the first one is just a fact based on how git diff works - if you have one big file, you're more likely to have merge conflicts, etc. And the second one, I wanted to cover not only the pipeline, but also node files there.