Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run kedro new without creating a new directory #681

Open
jaklan opened this issue Feb 1, 2021 · 31 comments
Open

Run kedro new without creating a new directory #681

jaklan opened this issue Feb 1, 2021 · 31 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@jaklan
Copy link

jaklan commented Feb 1, 2021

Description & Context

Currently, when running kedro new I have to specify a folder where my project is created. It does't make sense when I use venv or Poetry with in-project venv, because I have to create a directory by myself anyway to init a venv and install Kedro there. As a result, I have to manually move the new project to the upper folder with mv ~/repos/project_name/project_name/* ~/repos/project_name/.

Possible Implementation

Add flag to skip creating a folder.

Possible Alternatives

Ask about it during init.

@jaklan jaklan added the Issue: Feature Request New feature or improvement to existing feature label Feb 1, 2021
@WaylonWalker
Copy link
Contributor

I find it a bit awkward as well that you need to have kedro installed before you have created your project and likely have created your environment. Is it possible to give a cookie-cutter command alternative?

Potentially related to this chicken/egg situation of needing to have kedro installed (potentially globally) for project setup that may or may not be the version you are looking for in the project. Is it possible to achieve the same results of kedro install with pip install -e . Users may be onboarding to a new project in 0.16.x and simply grab the latest version off of pypi (pip install kedro) before running kedro install.

Other than pip install -r requirements.txt or reading requirements.txt for the specific version of kedro how should users know how to setup the project for local development?

pull bot pushed a commit to vishalbelsare/kedro that referenced this issue Apr 4, 2021
@stale
Copy link

stale bot commented Apr 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 12, 2021
@jaklan
Copy link
Author

jaklan commented Apr 12, 2021

Bump

@stale stale bot removed the stale label Apr 12, 2021
@lorenabalan
Copy link
Contributor

Hey @jaklan thanks for the suggestion! It sounds like it might be related to cookiecutter/cookiecutter#909 & cookiecutter/cookiecutter#907 - I suggest you also express your interest there, maybe it'll speed it along for the next cookiecutter release. I think it'd be better for them to address it rather than us building a custom solution, but I'll leave this open for a while to gauge interest on the feature request. 🤔

@lorenabalan
Copy link
Contributor

@WaylonWalker yes that's a valid point, though I'm not sure how it relates to the original question? It feels to me like it deserves its own separate discussion.

@stale
Copy link

stale bot commented Jun 19, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@astrojuanlu
Copy link
Member

I was oblivious to this issue because I was using out-of-tree environments with conda/mamba, but a colleague that just tried to use Kedro with venv experienced the same confusion. Here is some insight into in-tree (local) vs out-of-tree (global) environment workflows https://snarky.ca/classifying-python-virtual-environment-workflows/

Notice we already have some documentation about using venv or Pipenv instead of conda https://kedro.readthedocs.io/en/stable/faq/faq.html#can-i-create-a-virtual-environment-without-conda although I think we could make it a bit more clear (#2360).

I think it would be good to tag this as "Won't fix" (since it's unlikely that the upstream issue in cookiecutter is ever addressed). I'm going to go ahead and do it, otherwise folks feel free to reverse my decision cc @AntonyMilneQB

@astrojuanlu
Copy link
Member

Another side effect of not being able to init a Kedro project in the current directory: users create weird structures with 2 READMEs for nothing, like https://github.com/pablovdcf/TFM_HADO_Cares

I know this issue was closed 2+ years ago but honestly it was basically the first pain point I encountered #2360 and it keeps coming up over and over again. I'm reopening so that we can reprioritize.

@astrojuanlu astrojuanlu reopened this Oct 23, 2023
@astrojuanlu astrojuanlu removed the Resolution: Wontfix This problem or suggestion will not be implemented label Oct 23, 2023
@astrojuanlu
Copy link
Member

astrojuanlu commented Oct 27, 2023

This is needed to have venv/virtualenv as first-class citizens in the Kedro installation instructions I think (otherwise the workflow is too weird).

An informal poll run by @/brettcannon showed that the majority of developers store virtual environments next to their project code. https://snarky.ca/classifying-python-virtual-environment-workflows/

image

@antonymilne
Copy link
Contributor

I wonder what an equivalent poll result would look like for Kedro users. I suspect it would be much more biased towards global/central directory due to the prevalence of conda.

The point definitely still stands and it would be great if there were a better way to handle in-tree environments, but I would just be cautious assigning priority based on the above poll. It might be worth even doing the same sort of poll for kedro users (or potential kedro users I guess, because there's a selection bias as those who end up using kedro are more likely to be those who had a smooth experience using global environments). Maybe some such poll already exists for data scientists/similar.

@jaklan
Copy link
Author

jaklan commented Oct 27, 2023

@antonymilne there's no need to overthink that topic and refer to any polls - Kedro should allow people to generate a project in the current directory, period. In-project venvs are absolutely common in the Python ecosystem and they should be simply supported. Especially taking into consideration the fact you can solve the issue with one command moved to Kedro internals.

@noklam
Copy link
Contributor

noklam commented Oct 27, 2023

How would such function implemented? If i understand correctly cookiecutter still don't support this today, so it need to be implemented in Kedro.

Do we need to handle edge cases with existing files? Assuming an empty folder will be easy, would this be good enough?

@jaklan
Copy link
Author

jaklan commented Oct 28, 2023

@noklam I have already answered that in the initial issue (more than 2 years ago btw...) - you can mimic mv inside Kedro.

Assuming an empty folder will be easy, would this be good enough?

Of course no, because the whole discussion is about generating a project in the current directory, because you have .venv already created there (and probably other files, like the Poetry-specific ones, as well).

Do we need to handle edge cases with existing files?

You can simply display a proper warning when running kedro new with e.g. --cwd flag and wait for user confirmation then. Of course it can be more sophisticated and analyse if any files will be overwritten or not, but that's the easiest approach to start with.

Generally, I see there's also another issue about Poetry itself: #1722, so you need to implement a mechanism to move files anyway if you really want to support it. But there's also another approach - utilise a different, globally installed, CLI tool to initalise Kedro projects - e.g. kedro-starter. This way you avoid chicken & egg problem.

@astrojuanlu
Copy link
Member

astrojuanlu commented Oct 28, 2023

Indeed, cookiecutter does not support this, nothing has changed since #681 (comment)

Notice that cookiecutter already has "override/fail if exists" functionality, it's just that it always creates a subdirectory. We'll probably have to move files around.

We could start with a conservative stance, like "if any of the files I'm going to create already exists, fail". But this is easier said than done, because then kedro new would need knowledge of the cookiecutter structure cookiecutter/cookiecutter#1004

It's actually easier to blindly copy everything over, but this poses data loss risk.

copier handles this beautifully, but refactoring kedro away from cookiecutter would be painful:

IMG_20231028_164656

I don't think this is impossible though. Once we agree this is needed, we'll have to carefully think the path of least resistance.

@astrojuanlu
Copy link
Member

I never addressed @antonymilne 's point:

I wonder what an equivalent poll result would look like for Kedro users. I suspect it would be much more biased towards global/central directory due to the prevalence of conda.

And also because we are hiding our venv instructions behind a collapsible menu + the ergonomics are really weird:

image

so I wouldn't be surprised if the current users got sort of used to it. But that's the key trap we have to avoid, and you spelled it out already:

(or potential kedro users I guess, because there's a selection bias as those who end up using kedro are more likely to be those who had a smooth experience using global environments)

I can run an informal poll in Slack and see what people think.

@astrojuanlu
Copy link
Member

On one hand, our over-reliance on conda creates some trouble for certain users. For example, here is a user that is struggling to install a compatible version of Kedro on Python 3.8 because of the pip and setuptools constraints https://linen-slack.kedro.org/t/16034230/hello-i-have-created-a-kedro-matlab-custom-dataset-which-i-w#20e4ffe4-e697-47fa-a722-d74a752b7bed

On the other hand, as much as I'd like this to happen, I'm reconsidering how impactful the change would be, because according to an informal survey I ran on Slack, the main annoyance seems to be that users have a "global" Kedro and a project-specific Kedro https://linen-slack.kedro.org/t/16040768/u05bdslpj72-finally-gave-the-steps-in-https-kedro-org-slack-#db2c34ed-43bf-4479-839c-5a4fb4154a10

So much so, that some users don't use kedro new at all and rely on cookiecutter directly ❗ https://linen-slack.kedro.org/t/16031681/hello-here-wave-skin-tone-3-i-come-with-a-thorny-question-to#e0833bfb-336a-4e7c-acef-948cc0146694 cc @inigohidalgo and this is tricky because I don't know what are the implications of the new add-ons flow on this workflow.

However, this points to a new interesting direction that might have even more impact: making kedro-new a non-mandatory plugin. That's a large change that will need to be discussed in its own issue.

@astrojuanlu
Copy link
Member

Issue about improving our installation documentation #3281

@astrojuanlu
Copy link
Member

Notice that, if kedro new could init the current directory, in principle users wouldn't need a global Kedro, but there would still be two installation steps:

  1. Create project directory mkdir spaceflights && cd spaceflights
  2. Create venv python -m venv .venv && source .venv/bin/activate
  3. Install Kedro pip install kedro
  4. kedro new --outdir . (or whatever)
  5. Install project dependencies pip install -r requirements.txt

At least 2 users consider this second installation step confusing but there's not much else that can be done I believe https://linen-slack.kedro.org/t/16040768/u05bdslpj72-finally-gave-the-steps-in-https-kedro-org-slack-#8a7c5923-7fee-4d46-9229-1ca566b248e8

@datajoely
Copy link
Contributor

Is there a route where we make pipx the recommended install path?

@astrojuanlu
Copy link
Member

I don't think so. The problem here is that there are 2 CLI commands that have conflicting purposes:

  • kedro new is unencumbered by project dependencies because the project doesn't exist by the time it's called, and also it doesn't change a lot so there's no point in upgrading.
  • kedro run (and anything related to the actual Kedro project) require the project dependencies to work (hence mandate that kedro is installed in the same environment) and also it should be up to date to benefit from new features, bug fixes, and performance improvements

I don't think there's a way to reconcile these two sets of requirements.

@datajoely
Copy link
Contributor

Slightly out there suggestion - If we had a web ui on the Website for the project add-ons workflow we could then spit out a folder that covers the kedro new part

@jaklan
Copy link
Author

jaklan commented Nov 7, 2023

Let me just bump an idea mentioned above:

But there's also another approach - utilise a different, globally installed, CLI tool to initalise Kedro projects - e.g. kedro-starter. This way you avoid chicken & egg problem.

I believe it would solve many of the issues discussed in that thread. You could install kedro-starter with e.g. pipx then and it would be a quite similar experience to using cookiecutter directly (in other words, you could treat kedro-starter as a wrapper on top of cookiecutter).

@inigohidalgo
Copy link
Contributor

inigohidalgo commented Nov 7, 2023

@jaklan that would overcome the main limitation I see from installing kedro in pipx, which is the conflicting versions between the local .venv kedro and the global pipx kedro. If the tool only takes up the kedro-starter or kedro-new namespace and leaves kedro open to only be the local .venv version that would be a clean separation.

@martxelo
Copy link

I think this feature is really useful. I use venv inside the root directory of my projects. Normally:

  • I create a GitHub repository (with README.md and .gitignore)
  • Clone the repo && cd repo
  • Then create the venv and activate.
  • New requirements.txt (numpy, pandas... kedro)
  • kedro new . (for example)

It is not likely that I start my project in an existing one, so problems with pre-existing files is not something I am afraid of.

This is a quite simple feature (when explained) but apparently it has many internal tricky parts. I hope you can solve it.

👍

@astrojuanlu
Copy link
Member

Another user complained about this today.

@pietroppeter
Copy link

I would be interested in this and since it has not been mentioned let me add another use case that I guess it will increase in popularity: initializing a project with uv, adding kedro and then wanting a kedro project (or in general wanting to manage a kedro project with uv).

As mentioned in this issue kedro new would not work in a project already initalized by uv.
If instead I initialize a new kedro project (for example with uvx kedro new) it is not evident how to adapt it to having being managed by uv instead of the standard kedro.

Just by comparing the pyproject.toml file I have a bare uv initialized project with:

[project]
name = "my-project"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "kedro>=0.19.8",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

The kedro initialized project (in this case with all options) has pyproject.toml that starts like this:

requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
name = "poste_logistics"
readme = "README.md"
dynamic = ["dependencies", "version"]

[project.scripts]
poste-logistics = "poste_logistics.__main__:main"

[project.entry-points."kedro.hooks"]

[project.optional-dependencies]
docs = [
    "docutils<0.21",
    "sphinx>=5.3,<7.3",
     "sphinx_rtd_theme==2.0.0",
    "nbsphinx==0.8.1",
    "sphinx-autodoc-typehints==1.20.2",
    "sphinx_copybutton==0.5.2",
    "ipykernel>=5.3, <7.0",
    "Jinja2<3.2.0",
    "myst-parser>=1.0,<2.1"
]

[tool.setuptools.dynamic]
dependencies = {file = "requirements.txt"}
version = {attr = "poste_logistics.__version__"}

[tool.setuptools.packages.find]
where = ["src"]
namespaces = false

...

The main differences I see is that kedro uses setuptools while uv uses hatch (a comparison). Not sure what that entails (I do not know much about both).

More importantly dependencies (and version) are dynamically managed and taken from requirements.txt while uv will manage them directly in the toml file.

Given this state, it seems to me that for this case it might be better for me to start with the uv initialized project and try to add kedro content to it.

...and while I finished writing this, I of course looked for it and there is already an issue about using kedro and uv together: #4116

I think it is anyway fair to have this mentioned here, but probably I should follow up the discussion there...

@datajoely
Copy link
Contributor

I'm actually fully aligned with @pietroppeter on this, typically Kedro as a project takes a very conservative approach to integrations. For example, we only started building a VS Code plug-in once we were sure it had won against PyCharm.

I'm very much on the asrtral plane so I would love to make a uv first UX.

@astrojuanlu
Copy link
Member

astrojuanlu commented Aug 27, 2024

@pietroppeter Could you try this?

$ uv init --lib
$ uvx kedro-init .
$ uv add kedro
$ uv run kedro registry list
$ uv run kedro pipeline create data_processing

?

@pietroppeter
Copy link

pietroppeter commented Aug 27, 2024

ahah what is that? :)
it works indeed!
it misses some stuff (e.g. a gitignore, which is also not in uv by default) but it seems already a better starting point that what I had. Thanks!

for ref the toml at the end (from a clean uv init) is:

[project]
name = "my-project"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "kedro>=0.19.8",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.kedro]
project_name = "my-project"
package_name = "my_project"
kedro_init_version = "0.19.8"

[tool.kedro_telemetry]
project_id = "27c07122aeb74b56ba1b15da5d71141e"

not sure what the telemetry is about, but I guess that is fair for a new thing... :)

edit: added output of tree:

.
├── README.md
├── conf
│   ├── base
│   │   └── parameters_data_processing.yml
│   └── local
├── pyproject.toml
├── src
│   └── my_project
│       ├── __init__.py
│       ├── __pycache__
│       │   ├── __init__.cpython-312.pyc
│       │   ├── pipeline_registry.cpython-312.pyc
│       │   └── settings.cpython-312.pyc
│       ├── pipeline_registry.py
│       ├── pipelines
│       │   └── data_processing
│       │       ├── __init__.py
│       │       ├── nodes.py
│       │       └── pipeline.py
│       └── settings.py
├── tests
│   └── pipelines
│       └── data_processing
│           ├── __init__.py
│           └── test_pipeline.py
└── uv.lock
for ref below is the output for the commands:
(default) ➜  my-project git:(main) ✗ uvx kedro-init .
Installed 58 packages in 55ms
[11:00:11] Looking for existing package directories                                                                                                                                                                                      cli.py:25
[11:00:12] Initialising config directories                                                                                                                                                                                               cli.py:25
           Creating modules                                                                                                                                                                                                              cli.py:25
           🔶 Kedro project successfully initialised!                                                                                                                                                                                    cli.py:26
(default) ➜  my-project git:(main) ✗ uv add kedro
Resolved 53 packages in 102ms
   Built my-project @ file:///Users/pietropeterlongo/projects/my-project
Prepared 51 packages in 200ms
Uninstalled 1 package in 0.51ms
Installed 52 packages in 44ms
 + antlr4-python3-runtime==4.9.3
 + appdirs==1.4.4
 + arrow==1.3.0
 + attrs==24.2.0
 + binaryornot==0.4.4
 + build==1.2.1
 + cachetools==5.5.0
 + certifi==2024.7.4
 + chardet==5.2.0
 + charset-normalizer==3.3.2
 + click==8.1.7
 + cookiecutter==2.6.0
 + dynaconf==3.2.6
 + fsspec==2024.6.1
 + gitdb==4.0.11
 + gitpython==3.1.43
 + idna==3.8
 + importlib-metadata==8.4.0
 + importlib-resources==6.4.4
 + jinja2==3.1.4
 + kedro==0.19.8
 + kedro-telemetry==0.6.0
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + mdurl==0.1.2
 + more-itertools==10.4.0
 + omegaconf==2.3.0
 + packaging==24.1
 + parse==1.20.2
 + platformdirs==4.2.2
 + pluggy==1.5.0
 ~ poste-logistics==0.1.0 (from file:///Users/pietropeterlongo/projects/poste-logistics)
 + pre-commit-hooks==4.6.0
 + pygments==2.18.0
 + pyproject-hooks==1.1.0
 + python-dateutil==2.9.0.post0
 + python-slugify==8.0.4
 + pytoolconfig==1.3.1
 + pyyaml==6.0.2
 + requests==2.32.3
 + rich==13.8.0
 + rope==1.13.0
 + ruamel-yaml==0.18.6
 + ruamel-yaml-clib==0.2.8
 + six==1.16.0
 + smmap==5.0.1
 + text-unidecode==1.3
 + toml==0.10.2
 + types-python-dateutil==2.9.0.20240821
 + typing-extensions==4.12.2
 + urllib3==2.2.2
 + zipp==3.20.1
(default) ➜  my-project git:(main) ✗ uv run kedro registry list
[08/27/24 11:01:17] INFO     Using '/Users/pietropeterlongo/projects/my-project/.venv/lib/python3.12/site-packages/kedro/framework/project/rich_logging.yml' as logging configuration.                                        __init__.py:249
- __default__

[08/27/24 11:01:18] INFO     Kedro is sending anonymous usage data with the sole purpose of improving the product. No personal data or IP addresses are stored on our side. If you want to opt out, set the                          plugin.py:233
                             `KEDRO_DISABLE_TELEMETRY` or `DO_NOT_TRACK` environment variables, or create a `.telemetry` file in the current working directory with the contents `consent: false`. Read more at                                   
                             https://docs.kedro.org/en/stable/configuration/telemetry.html                                                                                                                                                        
(default) ➜  my-project git:(main) ✗ uv run kedro pipeline create data_processing 
   Built my-project @ file:///Users/pietropeterlongo/projects/my-project
Uninstalled 1 package in 0.47ms
Installed 1 package in 0.97ms
[08/27/24 11:01:56] INFO     Using '/Users/pietropeterlongo/projects/my-project/.venv/lib/python3.12/site-packages/kedro/framework/project/rich_logging.yml' as logging configuration.                                        __init__.py:249
Using pipeline template at: '/Users/pietropeterlongo/projects/my-project/.venv/lib/python3.12/site-packages/kedro/templates/pipeline'
Creating the pipeline 'data_processing': OK
  Location: '/Users/pietropeterlongo/projects/my-project/src/my_project/pipelines/data_processing'
Creating '/Users/pietropeterlongo/projects/my-project/tests/pipelines/data_processing/__init__.py': OK
Creating '/Users/pietropeterlongo/projects/my-project/tests/pipelines/data_processing/test_pipeline.py': OK
Creating '/Users/pietropeterlongo/projects/my-project/conf/base/parameters_data_processing.yml': OK

Pipeline 'data_processing' was successfully created.

[08/27/24 11:01:56] INFO     Kedro is sending anonymous usage data with the sole purpose of improving the product. No personal data or IP addresses are stored on our side. If you want to opt out, set the                          plugin.py:233
                             `KEDRO_DISABLE_TELEMETRY` or `DO_NOT_TRACK` environment variables, or create a `.telemetry` file in the current working directory with the contents `consent: false`. Read more at                                   
                             https://docs.kedro.org/en/stable/configuration/telemetry.html                                                                                                                                                        
(default) ➜  my-project git:(main) ✗

@astrojuanlu
Copy link
Member

astrojuanlu commented Aug 27, 2024

For the gitignore, I usually do

$ curl -sL "https://gitignore.io/api/python,jupyternotebooks" > .gitignore

😄 but duly noted! astrojuanlu/kedro-init#4

@astrojuanlu
Copy link
Member

A user:

double pyproject.toml makes me want to cry

(and I wholeheartedly agree)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

10 participants