Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a docs page about adding code beyond starter files #3852

Closed
Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f4330f
Add a tutorial on code beyond starter files
yury-fedotov May 6, 2024
036cbe0
Mention change in RELEASE.md
yury-fedotov May 6, 2024
02f0274
Address Vale comments on UK endings
yury-fedotov May 6, 2024
5875e2d
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov May 10, 2024
4d7a35a
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
ca00b72
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
ff01643
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
b858406
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
144783b
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
cf3a65f
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
729dc1a
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
2d22bbf
Merge branch 'refs/heads/main' into docs/code-beyond-starter-files
yury-fedotov May 17, 2024
a58ff32
Link deepdives to a list of examples
yury-fedotov May 17, 2024
2ed9639
Merge branch 'refs/heads/main' into docs/code-beyond-starter-files
yury-fedotov May 25, 2024
325489d
Revert weird MLflow release note edit
yury-fedotov May 25, 2024
1853e29
Remove note on changing registry location
yury-fedotov May 25, 2024
9fc168e
Simplify domain logic comment
yury-fedotov May 25, 2024
50a5bba
Replace tp.Dict by dict
yury-fedotov May 25, 2024
3024560
Remove pyproject.toml from monorepo tree example
yury-fedotov May 25, 2024
0df55ba
Replace historical and inference as pipeline split example
yury-fedotov May 25, 2024
a9f0e3a
Add a note about find_pipelines()
yury-fedotov May 25, 2024
c02309c
Remove article before find pipelines
yury-fedotov May 25, 2024
6ee9ad2
Apply suggestions from code review
merelcht Jul 9, 2024
8e811cf
Merge branch 'main' into docs/code-beyond-starter-files
merelcht Jul 9, 2024
fd96088
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 11, 2024
4310cf6
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 18, 2024
0e3dcf7
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 22, 2024
fe19a45
Merge branch 'main' into docs/code-beyond-starter-files
noklam Aug 5, 2024
c20e741
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Aug 28, 2024
a508fe0
Merge branch 'kedro-org:main' into docs/code-beyond-starter-files
yury-fedotov Sep 7, 2024
52ed71e
Implement Nok's comments re: utility functions
yury-fedotov Sep 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

## Documentation changes
* Improved documentation for custom starters
* Added a guide on extending a Kedro project beyond files generated by default.

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Expand Down Expand Up @@ -54,6 +55,7 @@ Many thanks to the following Kedroids for contributing PRs to this release:

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:

* [ondrejzacha](https://github.com/ondrejzacha)
* [Puneet](https://github.com/puneeter)

Expand Down
133 changes: 133 additions & 0 deletions docs/source/kedro_project_setup/code_beyond_starter_files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Adding code beyond starter files

After you [create a Kedro project](../get_started/new_project.md) and
[add a pipeline](../tutorial/create_a_pipeline.md), you notice that Kedro generates a
few boilerplate files: `nodes.py`, `pipeline.py`, `pipeline_registry.py`...

Check warning on line 5 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'few' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'few' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 5, "column": 1}}}, "severity": "WARNING"}

While those may be sufficient for a small project, they quickly become large, hard to

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'sufficient' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'sufficient' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 20}}}, "severity": "WARNING"}

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'quickly' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'quickly' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 57}}}, "severity": "WARNING"}

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.words] Use '' instead of 'quickly'. Raw Output: {"message": "[Kedro.words] Use '' instead of 'quickly'.", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 57}}}, "severity": "WARNING"}
read and collaborate on as your codebase grows.
Those files also sometimes make new users think that Kedro requires code
to be located only in those starter files, which is not true.

Check warning on line 10 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 10, "column": 15}}}, "severity": "WARNING"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the clarification, but I found this sounds like coming from an user rather than the official docs.

#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the code_beyond_starter_files.md?
Or to make content of comments on that issue part of this new section of docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I am not the best person to ask for English but I'll try my best🤓

While those may be sufficient for a small project, they quickly become large, hard to read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code to be located only in those starter files, which is not true.

I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and find_pipeline use that convention), but not mandatory to name files pipeline.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam Your examples is specifically around pipeline files, while I wanted to convey 2 things here:

  • As project evolves and nodes.py / pipeline.py files grow, they become challenging to manage.
  • Good news is that you can rename and restructure them.

Do you disagree with those? I think the first one is just a fact based on how git diff works - if you have one big file, you're more likely to have merge conflicts, etc. And the second one, I wanted to cover not only the pipeline, but also node files there.

merelcht marked this conversation as resolved.
Show resolved Hide resolved

This section elaborates what are the Kedro requirements in terms of organising code

Check warning on line 12 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'in terms of' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'in terms of' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 12, "column": 57}}}, "severity": "WARNING"}
merelcht marked this conversation as resolved.
Show resolved Hide resolved
in files and modules.
It also provides examples of common scenarios such as sharing utilities between
pipelines and using Kedro in a monorepo setup.

Check warning on line 15 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.Spellings] Did you really mean 'monorepo'? Raw Output: {"message": "[Kedro.Spellings] Did you really mean 'monorepo'?", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 15, "column": 32}}}, "severity": "WARNING"}

## Where does Kedro look for code to be located

The only technical constraint for arranging code in the project is that `pipeline_registry.py`

Check warning on line 19 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 19, "column": 5}}}, "severity": "WARNING"}
file must be located in `<your_project>/src/<your_project>` directory, which is where
it is created by default.

This file must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]`
mapping from pipeline name to corresponding `Pipeline` object.

Other than that, **Kedro does not impose any constraints on where you should keep files with
`Pipeline`s, `Node`s, or functions wrapped by `node`**.

```{note}
You actually can make Kedro look for pipeline registry in a different place by modifying the
`__main__.py` file of your project, but such advanced customisations are not in scope for this section.
```
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved

This being the only constraint means that you can, for example:

Check warning on line 34 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 34, "column": 16}}}, "severity": "WARNING"}
merelcht marked this conversation as resolved.
Show resolved Hide resolved
* Add `utils.py` file to a pipeline folder and import utilities used by multiple

Check warning on line 35 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'multiple' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 35, "column": 73}}}, "severity": "WARNING"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid utils.py as much as possible as they are the bin for everything. It's ironic because kedro do have utils module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam how would you then call a .py file that is not nodes.py or pipeline.py for the purpose of this example? I was thinking of dataframe_utils.py, but didn't like it because it adds to the impression that Kedro is only useful in data processing projects, which isn't true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is many thing that I don't like about utils (just me, not a team consensus). The purpose is ill-defined and often become the place that people dump code to without thinking.

Even if we go with utils. I will strip the _utils suffix, it feels redundant to have pandas_utils.py under utils.py. Then in the code I will probably do from <pacakge> import utils. When I need to use it, I will use utils.pandas.func to make it clear that this is a util namespace but not pandas.

visualitization_utils.py could just be visualisation module itself.
(all above are subjective)

I think it will be great to first introduce the principle of share module, what are the factors to consider. Then you can show this example. https://kedro-org.slack.com/archives/C03RKP2LW64/p1716912123397259
@datajoely do you have thought about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Comments re: naming utils modules and importing them - addressed ✔️
  • Re: introducing the principles, the hyperlink doesn't work for me, leads just to the questions channel.

functions in `nodes.py` from there.
* [Share modules between pipelines](#sharing-modules-between-pipelines).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro that we can link to?

Even just the tree structure of a project would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that exactly what I'm adding there?

Screenshot 2024-05-25 at 12 47 31 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is good, but I think we can link https://github.com/kedro-org/awesome-kedro/blob/master/README.md#example-projects to direct people for more examples. It's also helpful to see real code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually went through all Kedro projects listed there, I don't think there's any one that implements shared things between pipeline in a way that would be a good example to follow here. The closest one is this one: https://github.com/pablovdcf/TFM_HADO_Cares/tree/main/hado/src/hado

Those could be utility functionalities, or your standalone module responsible for
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
the domain logic of the industry you work at.
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
* [Use Kedro in a monorepo setup](#kedro-project-in-a-monorepo-setup) if there are
software components independent of Kedro that you want to keep together in the version control system.
* Delete or rename a default `nodes.py` file, split it into multiple files or modules.
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example,
into `historical_pipeline.py` and `inference_pipeline.py`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we usually go for the namespace/modular structure more:

  • pipelines
    • historical
    • inference
      • pipeline.py

@astrojuanlu thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that historical vs inference sound like candidates for leveraging same pipeline just different namespace. To avoid the confusion, I changed the example to this:

If you have multiple large `Pipeline` objects defined in a single `pipeline.py`,
split them into separate `.py` files. For example, in `data_processing` pipeline
you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`.

Is that better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

* Instead of registering many pipelines in `register_pipelines()` function one by one,
create a few `tp.Dict[str, Pipeline]` objects in different places of the project
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
and then make `register_pipelines()` return a union of those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about find_pipelines()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions are:

  • is historical_pipeline.py and inference_pipelines.py better than pipelines/historical/pipeline.py (the modular pipeline structure) that Kedro usually promotes?

I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the pipeline.py is just a convention and for find_pipeline works automatically. User still have the option to register manually if desired.

@astrojuanlu thought?


## Common codebase extension scenarios

This section provides examples of how you can handle some common cases of adding more
code to or around your Kedro project.
The provided examples are by no means the only ways to achieve the target scenarios,
and serve only as illustrative purposes.

### Sharing modules between pipelines

Oftentimes you have machinery that has to be imported by multiple `pipelines`.
merelcht marked this conversation as resolved.
Show resolved Hide resolved
To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same
level as the `pipelines` folder**, and organise the functionalities there:

```text
├── conf
├── data
├── notebooks
└── src
├── my_project
│ ├── __init__.py
│ ├── __main__.py
│ ├── pipeline_registry.py
│ ├── settings.py
│ ├── pipelines
│ └── utils <-- Create a module to store your utilities
│ ├── __init__.py <-- Required to import from it
│ ├── pandas_utils.py <-- Put a file with utility functions here
│ ├── dictionary_utils.py <-- Or a few files
│ ├── visualisation_utils <-- Or sub-modules to organise even more utilities
└── tests
```

Example of importing a function `find_common_keys` from `dictionary_utils.py` would be:

```python
from my_project.utils.dictionary_utils import find_common_keys
```

```{note}
For imports like this to be displayed in IDE properly, it is required to perform an editable
merelcht marked this conversation as resolved.
Show resolved Hide resolved
installation of the Kedro project to your virtual environment.
This is done via `pip install -e <root-of-kedro-project>`, the easiest way to achieve
this is to `cd` to the root of your Kedro project and run `pip install -e .`.
```

### Kedro project in a monorepo setup

The way a Kedro project is generated may build an impression that it should
only be acting as a root of a `git` repo. This is not true: just like you can combine
multiple Python packages in a single repo, you can combine multiple Kedro projects.
Or a Kedro project with other parts of your project's software stack.

```{note}
A practice of combining multiple, often unrelated software components in a single version
control repository is not specific to Python and called [_**monorepo design**_](https://monorepo.tools/).
```

A common use case of Kedro is that a software product built by a team has components that
are well separable from the Kedro project.

Let's use **a recommendation tool for production equipment operators** as an example.
This example consists of three parts:

| **#** | **Part** | **Considerations** |
|-------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry | <ul> <li>Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.</li> </ul> |
| 2 | An optimiser that leverages the ML model and implements domain business logic to derive recommendations | <ul> <li>A good design consideration might be to make it independent of the UI framework.</li> </ul> |
| 3 | User interface (UI) application | <ul> <li>This can be a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.</li> <li>Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).</li> <li>Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.</li> </ul> |

A suggested solution in this case would be a **monorepo** design. Below is an example:
merelcht marked this conversation as resolved.
Show resolved Hide resolved

```text
└── repo_root
├── packages
│ ├── kedro_project <-- A Kedro project for ML model training.
│ │ ├── conf
│ │ ├── data
│ │ ├── notebooks
│ │ ├── ...
│ ├── optimizer <-- Standalone package.
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline.
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages.
├── pyproject.toml <-- Settings for those, like `[tool.isort]`.
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
└── ...
```
1 change: 1 addition & 0 deletions docs/source/kedro_project_setup/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@
dependencies
session
settings
code_beyond_starter_files
```
Loading