Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include conf parameters and specific small data files in kedro package #1607

Closed
vitorpbarbosa7 opened this issue Jun 9, 2022 · 13 comments
Closed
Labels
Component: Configuration Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@vitorpbarbosa7
Copy link

vitorpbarbosa7 commented Jun 9, 2022

Description

The package I'm developing requires to include some parameters from yml files and some pickles from data 06_models folder.

Context

Developing a machine learning pipeline model which receives as input, when deployed, some parameters, and also the model serialized as pickle.

Possible Implementation

I tried to use package_data, data_files, package_dir in setup.py, and MANIFEST.in, but even with those, could not include the data files and conf yml files in the final wheel.

From my understading of how setup.py works, it will only search for resources in the sources parent directory, that's why, I think, the possible implementations i told, are not working.

Possible Alternatives

If it's not possible to refer to previous files from source in the directory tree, would it be possible to move the data and conf folders to src and change any parameter in kedro to look for those folders in that new address? (I know this is probably too much of changing of how the framework was though)

I read the documentation which says:
"Recipients of the .egg and .whl files need to have Python and pip on their machines, but do not need to have Kedro installed. The project is installed to the root of a folder with the relevant conf/, data/ and logs/ subfolders, by navigating to the root and calling:"
https://kedro.readthedocs.io/en/latest/tutorial/package_a_project.html

but that's exactly what I do not want. I really need more files to be packed together with source code.

Tried to solve this problem with some of said here:
https://stackoverflow.com/questions/32609248/setuptools-adding-additional-files-outside-package
But did not work

@vitorpbarbosa7 vitorpbarbosa7 added the Issue: Feature Request New feature or improvement to existing feature label Jun 9, 2022
@antonymilne
Copy link
Contributor

antonymilne commented Jun 10, 2022

Hi @vitorpbarbosa7, kedro package is a pretty thin wrapper for calling python setup.py (see https://github.com/kedro-org/kedro/blob/main/kedro/framework/cli/project.py#L133), so anything which is possible in the normal Python universe should be possible here. Hence you should be able to add extra files by tinkering with setup.py, even with those files outside src. I'll give it a go later to try and find the right options.

If it's not possible to refer to previous files from source in the directory tree, would it be possible to move the data and conf folders to src and change any parameter in kedro to look for those folders in that new address? (I know this is probably too much of changing of how the framework was though)

This actually is possible because you could configure CONF_SOURCE in settings.py to point to src/conf/ for example. But indeed it's not really the recommended way of doing things - better to try and work out the right way to get setuptools to include the other folders automatically.

@antonymilne
Copy link
Contributor

For example, this seems to work in setup.py:

    include_package_data=True,
    package_data={"": [str((Path.cwd() / "../conf/base/catalog.yml").resolve())]},

I think the key might be that you need to specify a full path to anything that's above setup.py in the directory hierarchy (hence the use of Path from pathlib). You should be able to do whole directories using some wildcard matching with * to include directories.

@antonymilne
Copy link
Contributor

I just thought of another way of doing this: if you move everything from src into the root of your kedro project (something like mv -r src/* .) and add source_dir = "." to the tool.kedro section of pyproject.toml then you won't have any difficulties with including things at a level above setup.py. This way it should be possible to just use MANIFEST.in and include_package_data=True since you won't need any .. in the MANIFEST.in which didn't seem to work for me.

@vitorpbarbosa7
Copy link
Author

vitorpbarbosa7 commented Jun 11, 2022

Thank you for your answers, they really helped.

What worked for me was moving everything from src into the root of the kedro root package, as you said, using the source_dir = ".", and making also use of MANIFEST.in alongside with include_package_data in setup.py, for recursively include the files I wanted. And actually, for making this really work, I had to create a init.py package in root directory of the whole kedro package.

General structure:
image

And the source directory, actually became the ames:
image

@antonymilne
Copy link
Contributor

Awesome, thanks very much for letting us know! Glad to hear it worked.

@antonymilne
Copy link
Contributor

@datajoely thinks we should consider again whether conf should be packaged as part of a project. Would like to get @idanov's thoughts on this.

@datajoely
Copy link
Contributor

Not by default, but provide a way to do so if they really want to.

@antonymilne
Copy link
Contributor

antonymilne commented Jun 28, 2022

I certainly think there's a good argument for this. From a QB user, I hear that some tools need you to provide everything in a single .whl file. Hence I think we need at least an official recommendation on how to get conf in there and possibly a way of doing it directly through kedro package.

This relates very closely to the question of exactly what conf should contain (applicative vs. external config), and whether/how a packaged kedro project should be able to point to a custom config location (e.g. through a conf-source argument). Let's consider this as part of the re-working of config. In the mean time I think we should at least have a documented recommendation on how to package conf.

@avan-sh
Copy link
Contributor

avan-sh commented Jun 28, 2022

Easing the option to package conf would definitely be really useful.

At the moment there's 2-3 options to package conf and needs 2-3 changes, simplifying this would decrease the deviation.Possibly moving setup.py & requirements might be enough.

@datajoely
Copy link
Contributor

Also note that setup.py is deprecated and we will have to move away from it in general

@antonymilne antonymilne added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jun 29, 2022
@merelcht merelcht added this to the Configuration overhaul milestone Jul 25, 2022
@merelcht
Copy link
Member

@vitorpbarbosa7 would you be able to explain a bit more why you want to include the model pickles inside the package?

@merelcht
Copy link
Member

merelcht commented Apr 5, 2023

Hi @vitorpbarbosa7 , the conf folder is now packaged up as a tar.gz from Kedro 0.18.7. I'll close this issue, but feel free to re-open if you want to continue the discussion 🙂

@noklam
Copy link
Contributor

noklam commented Apr 5, 2023

You will need to specify the config location via kedro run --conf-source=<path-to-compressed-file>.tar.gz, more details in the docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Configuration Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

6 participants