-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Universal Kedro deployment (Part 1) - Separate external and applicative configuration to make Kedro cloud native #770
Comments
(ignore me, just butting in here) to say that this is an amazingly well written issue - one of the best and most thorough I've seen in a long time. |
Still need to digest this completely. One thing I give props to the kedro team for, regarding templates, is the move from 0.16.x to 0.17.x. It was very very hard to work outside of the standard template in 0.16.x. It would flat error and not let you do things a "different" way in some cases. Composability0.17.x is MUCH more modular. You can compose your own template quite easily by composing the components of kedro you wish to use. To the point where you can easily create a pipeline, catalog, runner, and cli with very little code in a single script. In fact, I've done it. After working with DAGs for the past few years it feels very slow to work without one now. In some cases where there is a significant project already complete, it may not make sense to completely port to kedro, but rather bring in a bit of kedro as you maintain it. I treat Everything as a PackageI generally think of everything as a package, something that I can pip install, run from the command line. Or in the case of production put into a docker image. I think this workflow/deployment is what has to lead me to put everything into the package. I think it would be completely logical to find a balance of letting the user override parameters while providing good defaults for all of them inside your package. Again this is probably my small view into how I work. |
@Galileo-Galilei First, really thank you for this well-written issue and great analysis of some of the main challenges we currently face in Kedro. The things you have pointed out are a real problem we are trying to address, and we certainly are aware of those challenges. Your thoughts on that are really helpful since we mainly have access to the perspective of McKinsey and QuantumBlack users and hearing the viewpoint of someone not affiliated with our organisations is super valuable. I would like to add a few comments and maybe some clarifications on our thinking (or at times mostly my thoughts as Kedro's Tech Lead, since some of those might not have crystallised completely yet to be adopted as the official view of the team). Deployment / orchestrationA lot of Kedro is inspired by the relevant bits of The Twelve-Factor App methodology in order to aid deployment. Initially Kedro was often mistaken for an orchestrator, but the goal of Kedro has always been to be a framework helping the creation of data science apps which can then be deployable to different orchestrators. However this view might not have been perfectly reflected in the architecture due to lack of experience on our side and user side alike. Most recent changes in the architecture though have moved towards that direction as @WaylonWalker pointed out. In the future we’ll double down on the package deployment mode, e.g. you should be able to run your project as a Kedro package and the only necessary bit would be providing the configuration (currently under Now for the deployment model, we see a future where our users will structure their pipelines using namespaces (aka modular pipelines). Thus they will form hierarchies of nodes, where the grouping would be semantically significant for them. The top-level pipelines will be consisting of multiple modular pipelines, joined together into the overall dag. This way modular pipelines can be analogous to folders and nodes to files, e.g. After having your pipeline structured like that, then we can provide a uniform deployment plugin where users can decide the level at which their nodes will be run in the orchestrator, e.g. imagine something like There’s some additional subtleties we need to take care of, e.g. running different namespaces on different types of machines, e.g. GPU instances, Spark clusters, etc. But I guess the general idea is clear - the pipeline developer will have a much better control on how things get deployed without actually needing to learn another concept or make big nodes. They will just need to make sure that their pipeline is structured semantically meaningful for them and the orchestration, which is already an implicit requirement anyways and people tend to do that as per your example, but not in a standard way. ConfigurationLoggingThis one is supposed to be not needed, since Kedro has exactly the same defaults. So teams can directly get rid of it, unless they would like to change the logging pattern for different platforms, e.g. if you would like to redirect all your logs towards an ElasticSearch cluster, Sumologic or any other log collecting service out there. This configuration is environment specific (locally you might want colourful logging, but on your orchestrator that will be undesirable) and that's why it's not a good idea to package it with your code. CredentialsThis one is obviously environment specific, but what we should consider doing is adding an environment variables support. Unfortunately this has been on the backlog for a while, but doesn’t seem to be such an important issue that cannot be solved by DevOps, so we never got to implementing the environment variables for credentials. CatalogThis is a way bigger topic and much less clear how to solve it in a clean way, but something we have on our radar for quite some time. We want to come up with a neat solution for this one by the end of 2021, but obviously there’s many factors that will come into play and I cannot guarantee we can get it done by then. History of the problemIn my opinion, this challenge came from the fact that we treat each dataset as a unique type of data and this comes from the fact that we did not foresee that Kedro will enable the creation of huge pipelines on the order of hundreds of nodes with hundreds of datasets. However now most of our users internally have very big pipelines and a lot of intermediary datasets, which need to be defined in the catalog and not just passed in memory. Thus that created huge configuration files, which a lot of people wanted to simplify. That’s why the TemplatedConfigLoader was born out of user demand and not without some hesitation from our side. Why the current model is failingThe problem with the TemplatedConfigLoader is that it solves the symptom, but not the real problem. The symptom is the burdensome creation of many catalog entries. The problem is the need for those entries to exist at all. Maybe to clarify here, I will refer to web frameworks like Django or Rails - in all web frameworks, you define only one database connection and then the ORM implicitly maps the objects to that database. In Kedro, each object (i.e. dataset) needs to be configured on its own. Kedro’s model is good if you have a lot of heterogenous datasources (like the case of pipelines fetching data from multiple independent sources). But it quickly dissolves into chaos as you add multiple layers of intermediary datasets, which are, if not always, then for the most part of it, pointing to the same location and can be entirely derived from the name of the dataset. So the challenge here is that we need to support both per-dataset catalog entries and one configuration entry for hundreds of datasets. Whatever solution we come up with needs to work for both cases and be declarative at the same time. Why the catalog is configuration and not codeAs we are trying to emulate the one-build-multiple-deployments model, it becomes very clear that all catalog entries are entirely environment specific (e.g. with one build you might deploy once to S3 and then the second time to ABS or GCS). So this is definitely configuration that needs to live outside your codebase. However the current mode of defining every single dataset separately makes this process completely unmaintainable, so people came up with the templated config solution with the ParametersThe parameters configuration is an odd one because everyone uses it for different things. E.g. we see many users using it as a way to document their default values of all of their parameters, even when they don’t need to change that parameter. That made the parameter files huge and now they are very hard to understand without some domain knowledge. Some teams use these files as a way for non-technical users to do experiments on their own. Some teams would love to package their parameters in their code, since they treat it as a single place for all their global variables that they can use across their pipeline. The main challenge I see for the parameters files is that the way we merge those from One can argue that there should be a way to have a place in your SummaryI might not have answered any questions here or even given very specific directions on how Kedro will develop in the future, but the reason for that is that we don’t have very clear direction set yet on solving those problems. I hope that I have provided some insight into our understanding of the same problems and potentially clarifications why we haven’t solved them yet. One thing is sure though, we have this on our roadmap already and its turn is coming soon, e.g. there’s only 2 other things in front of it 🙂 Thanks for sharing your view on how we could tackle that and while we might not implement it as you have suggested, we'll definitely consider drawing some inspiration from it when we design the new solution. One particular detail that I like is getting rid of the |
Hi @Galileo-Galilei - I just wanted to say this is a high priority for us and point you towards our community update later this week, sign up here. The event starts at 6:30 PM here in London - see how that works for your timezone here. |
Wow, just discover this thread after I started this thread in GitHub Discussion. This issue is a much more in-depth one and I agree with most of it. I have been wanting to upgrade kedro but it is not easy and seems that 0.18.x will break something, so I am still waiting for it. @WaylonWalker Could you give an example of how 0.17x makes it easier? |
Hi, thank you very much for all who went on to discuss about the issue at stake here, and especially to @idanov for sharing your vision of kedro's future. This is extremely valuable to @takikadiri and I for increasing Kdero usage inside our organisation. First of all, apologies to @datajoely: I was aware of this retrospective, but I was (un?)fortunately in vacations this week with almost no internet connection and I could't join it. I had a look at the slides which are very interesting! Here are some thoughts / answers/ new questions which arise from above conversation, in no specific order: On 0.17.x increased modularity and flexibilityDisclaimer: I have not used 0.17.x versions intensively, apart from a few tests. I compare the features to the 0.16.X one's hereafter. For my personal experience, here are my list of pros and cons about 0.17.X features:
My team do not plan to migrate its existing projects because it generates a lot of migration costs (we have dozens of legacy projects + and internal CI/CD to update) and the advantages are not sufficient to yet to justify such costs. @WaylonWalker, you claim that "0.17.x is MUCH more modular". Do you have any real-world example of something which was not straightforward with 0.16.X versions and which is now much easier? On treating everything as a packageI perfectly agree on this point (and we do the same), but it raises two different points:
On deployment/orchestrationI have seen your progress on the topic, and I acknowledge that only needing a Regarding the deployment model, you are cutting the ground under my feet: In the "Universal Kedro deployment (Part 2)", I plan to adress the transition between different pipelines levels in a very similar way :) Kedro definitely needs a way to "factor and expand" the pipelines to have different view levels. This would be beneficial for a transition to another DAG tool, but also for frontend (kedro-viz visualisation) which becomes overcrowded very quickly. That said, I would not rely on the template's structure for several reasons:
I guess a declarative API (e.g. letting On configuration (back to the original topic :))LoggingLogging is obviously environment specific, I apologize if you thought I implied the opposite. I just meant we need a default behaviour, but if I understand what you are saying, it is already the case. CredentialsI do not understand what you mean by "[it] doesn’t seem to be such an important issue that cannot be solved by DevOps". My point is precisely that many CI/CD tools expect to communicate with the underlying application through environment variables (to my knowledge: I must confess that I am far from being a devops expert), and it is really weird to me that is not "native" in kedro. I must switch to the Whatever the problem is, it should be a minima better documented than it is now, given that some beginners ask this question on various threads, with a few ugly solutions (e.g. https://discourse.kedro.community/t/load-credentials-in-docker-image-using-env-vars/480, #49). The best reference I can find is in the issue #403. CatalogFirst, I agree that it is a big topic, and unlike most others I haven't a clear vision (yet?) of how it should be refactored. Some unsorted thoughts:
but in my opinion the root of all evil comes from this commit c466c8a, when the catalog .yml became "code" and no longer configuration with the ability of dynamically creating entries. I strongly advocated against it in my team, even if I understood why some users needed it.
and I cannot agree more. However, given the "debugging" use of the catalog, I totally agree that you should support both ways (per-dataset configuration and one configuration for several datasets) of defining catalog entries.
ParametersWe encoutered almost all the use case described here (overriding only a nested key, providing a way to experiment for a a non technical user, packaging the parameters) in different projects. The size of the Being able to override a nested parameters structure with a syntax like As you suggest (and as I describe my original post), my team uses this file to define default values, and the only really "moving" parameters are injected via the On your summmary
Sharing your vision on this is definitely valuable. I guess it will take a bunch of iterations to tackle the problem completely and reach an entirely satisfaying configuration management system, but some of the ideas discussed in this thread (moving
I am aware of experiment tracking, I wonder what the other one is ;)
I only care about the implemented features, not the implementation details. The goal of this thread is more to see whether the problem was shared by other teams, and to discuss the pros and cons of the different suggestions.
It seems quite a consensus in this thread that if we want to reduce the feature request to its core component, this would be the very one thing to implement. |
It's almost the 3 year anniversary of this issue 🎂🎈 I'm watching @ankatiyar's Tech Design session on I'd like to know what folks think about it in the current state of things. I don't want to drop a wall of text so here's my best attempt at summarising my thoughts:
ds:
type: spark.SparkDataset # Your code will break if you change this to pandas.CSVDataset
filepath: ... # Your code is completely independent from where the data lives, the dataset takes care of it and in fact @Galileo-Galilei hinted that when he wrote this proposal: # src/applicative-conf/catalog.yml
my_input_dataset:
type: pandas.CSVDataSet
filepath: ${INPUT_PATH}
It's fuzzy because during development users should be able to freely explore with different configurations for these (see also #1606) but then during production these parameters become "fossilized" and tied to the business logic.
With the experience we've gained in the past 3 years, the improvements in Kedro (namespace pipelines became a reality, Tagging @lrodriguezlujan and @inigohidalgo because we've spoken about these recently as well. |
Hi, this is a very valid question that need to be answered. We've accomplished a lot, and this needs to be reassessed. I have created a demo repository to implement what is suggested above and evaluate how easy it is to configure with recent versions, and what is still to be improved. I'll report my conclusions here when I am ready. |
Hi @Galileo-Galilei, I notice that you added some notes in your demo repository. We are trying to use Discussions for feature requests & enhancement proposals #3767 and doing an issue cleanup in the meantime. Since this issue is long and complex, and is the first in your 4-part series, would you be okay writing here what are your thoughts on where do we currently stand, so that we can either move this whole issue to Discussion or just close it and open follow-up, more focused Discussions? |
Above issue suggest a specific workflow and a lot of modifications to the framework. I'll try to sum up the state of the different feature requested above, and start by putting some of them aside because they will be tackled in other issues. I will then focus on the core request above, that is the integration in the src folder of the template of part of the configuration. Exposing credentials through the CLIThis will be adressed likely with #4320 and has been (and will be) largely discussed there. The only question left that I personnaly don't understand is the design choice to not make Exposing configuration with runtime_paramsSituation in kedro 0.19Above syntax works almost "out of the box", since Unsolved issuesHowever, I still find the dev experience not as good as it could be on this topic and I have a bunch of other feature requests specifically around it, but it is worth splitting it in a separate discussion and address each of them specifically:
Modifying the template to separate applicative and external configurationSituation in kedro 0.19The "official" way does not work...According to the official documentation from the configuration page (but it requires a good understanding of kedro because all these settings are scattered across the page) , we can manually update the template as follows:
# default conf is at the root
CONF_SOURCE = "."
# Keyword arguments to pass to the `CONFIG_LOADER_CLASS` constructor.
CONFIG_LOADER_ARGS = {
"base_env": "src/conf_app",
"default_run_env": "conf/local",
} ✅ This kind of work: when you execute
...but there is an unexpected workaroundIf you change # settings.py
from pathlib import Path
CONFIG_LOADER_ARGS = {
"base_env": (Path(__file__).parents[1] / "conf_app").as_posix(),
"default_run_env": "local",
} ✅ It does work as expected: locally, ❌ Bad news are:
Path("conf") / r"C:\Users\...\spaceflights-pandas\src\conf_app" returns WindowsPath('C:/Users/.../spaceflights-pandas/src/conf_app') when Important decisions to make before we can consider this feature request properly addressed
|
There are some absolute bits of 🏅 gold dust in this write up @Galileo-Galilei 💪 The bits that stick out to me:
This is so on the money and I've not really thought about it before. Globals has always been a compromise, we've resisted doing it at every stage so we're left with a situation that evolved and was never holistically designed.
There this is also very important and we can do some small tweaks to massively improve the developer / consumer experience. This point also touches on a wider 'what data contracts' does kedro implictly expect, but we don't have any proper validation (pydantic, type hints etc). |
|
Kedro and the twelve-factor have different interpretations of configuration semantics. I believe this disparity in meaning is the primary obstacle to implementing this feature. I think that only the user can define which part of the project vary between deploys depending on his context and needs. Those varying part could be declared as Here are some benefits that could be enabled by this feature:
|
Preamble
Dear Kedro team,
I've been using Kedro since July 2019 (
kedro==0.14.3
, which is quite different from what kedro is now) and my team has deployed in the past 2 years a few dozens of machine learning pipelines in production with Kedro. I want to give you some feedback on my Kedro experience along this journey, and the advantages and drawbacks of Kedro from my team's point of view with the current versions (0.16.x and 0.17.x):Advantages:
Drawbacks:
Project orchestration: you assume we will map kedro
nodes
to the orchestrator nodes. This is not realistic, and in a discussion with @limdauto here we agreed on the fact that the conversion to the pipeline's nodes is complicated and must be thoroughly thought by the person in charge of the deploymentConfiguration management: All deployment tutorials assume that configuration will be changed directly inside the kedro project (e.g., modify the catalog to persist some objects, change path to make them relative...). This makes the very strong assumption (which does not hold often in my personal experience) that the person which will deploy the project (the ops) has access to the underlying application (i.e. the code folder). This is the issue addressed in this design document.
This issue is likely the first one of a serie, and I will focus specifically on Kedro's configuration management system. To give some credits, hereafter suggestions come in a vast majority from discussions, trials and errors with @takikadiri when trying to deploy our Kedro projects.
Disclaimer : I may use the words "should" or "must" in the following design document, and use very assertive sentences which reflect my personal opinion. Theses terms must be understood in regards to the underlying software engineering principles I describe explicitly when needed. My sincere apologies if it offends you, it is by no mean an order to do a specific action, I know you have your own clear vision of what Kedro should bend towards.
Context
Deploying a kedro application
A brief description of the workflow
A common workflow (at least for me, as a dev) is to expose some functionalities to an external person (an ops) which will be in charge to create the orchestration pipeline. A sketch of the workflow is the following:
kedro run --pipeline=<pipeline_name>
pipeline_3=pipeline_1+pipeline_2
, because we often do not want to execute them at the same time, and we want to have retry strategies because the logic is much more complex than this exampleDeployment constraints to deal with
Note that changing the workflow or asking the ops to modify the kedro project are out of the list of possible solutions, since I work in a huge organisation with strictly standardized processes that cannot be modified only for my team.
Challenges created by Kedro's configuration management implementation
Identifying the missing functionality: overriding configuration at runtime
In regards of previously described workflow, it should be clear that the ops must be able inject some configuration at runtime, e.g. some credentials (password to database connexion, to mlflow), some path to the data, eventually some parameters... This should be done without modifying the yaml config files : the project folder is not even visible for the ops, and we want to avoid operational risk if he were to modify configuration of a project he knows nothing about.
Overview of potential solutions and their associated issues as of
kedro==0.17.3
With the current version of kedro, we have two possibilities when packaging our project to make it "executable":
catalog.yml
seems common)conf
folder tosrc/
, or by packaging the entire folder (e.g. with a run.sh file at the root to make it "executable like"). This is roughly what is suggested by @WaylonWalker in Package conf with the project package #704 and while it is in my opinion better than the previous bullet point, it is not acceptable as is for the following reasons:As a conclusion, both solutions have critical flaws and cannot be considered as the correct way to handle configuration management when deploying a kedro project as a standalone application.
Thoughts and design suggestions for refactoring configuration management
Underlying software engineering principles : decoupling the applicative configuration from the external configuration
All the problems come from the fact that Kedro currently consider all configuration files as identical while they have different roles:
catalog.yml
and theparameters.yml
are project specific (they contain the business logic) and we do not expect our users to modify them, except maybe some very small and specific parts that the dev must choose and control. It is not reasonable to assume that the person which will deploy the app knows Kedro's specificities and the underlying business logic. These files are the applicative configuration and must be packaged with the project. We should likely package thelogging.yml
file too, because it is very likely that only advanced users will need to modify it.credentials.yml
(and theglobals.yml
if one uses theTemplatedConfigLoader
as suggested in your documentation are exposed to our users and must be modified/injected at runtime. They are the external configuration. They depend on the IT environment they are executed in, and they should NOT be packaged with the code in regards of the build once, deploy everywhere principle.Refactoring the configuration management
Part 1: Refactor the template to make a clear separation between external and application configuration
I suggest to refactor the project template from this:
With such a setup, the applicative configuration should be packaged with the project, which will make the pipelines much more portable. Two key components should be updated to match all the constraints: the TemplatedConfigLoader and the run CLI command.
Part 2: Update the ConfigLoader
conf/base
folder tosrc/
.With this system, the dev would choose and define explictly what is exposed to the end users thanks to the TemplatedConfigLoader system, e.g.:
Part 3 (optional) : Update the
run
commandIf possible, the run command should explictly enable to dynamically expose only the variables in the globals. once packaged, the end user would be able to run the project with either:
kedro run
(use default values) -> he will need to add INPUT_CREDENTIALS as an environment variable since there is no default for itkedro run --INPUT_CREDENTIALS=<MY_VERY_SECURED_PASSWORD>
(use default values + inject password at runtime, not secured at all, it will end up in the log!)kedro run --NUMBER_OF_TREES=200
(still with INPUT_CREDENTIALS as an environment variable)The end user cannot modify what is not exposed by the end user through the CLI or env variables (e.g. save args for the CSVDataSet), except if they are exposed in the
globals_default.yml
file and made dynamic by the developer.Obviously, the user can still recreate a
conf/<env-folder>/catalog.yml
folder to override the configuration, but he should not be forced (nor even encouraged) to do this.Alternative considered
I could create a plugin to implement such changes by creating a custom
ProjectContext
class, but the suggested template changes, albeit easy to implement, would make it hard to follow your numerous evolutions in the template. It would make much more sense to implement at least these template changes in the core library.@yetudada, sorry to ping directly, but you told me you were working on configuration refactoring. Do such changes make sense in the global picture you have in mind?
The text was updated successfully, but these errors were encountered: