Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud native credentials storage #1280

Closed
pypeaday opened this issue Feb 23, 2022 · 4 comments
Closed

Cloud native credentials storage #1280

pypeaday opened this issue Feb 23, 2022 · 4 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@pypeaday
Copy link

Description

It is frustrating to try and incorporate secure credentials storage mechanisms that are cloud-native such as AWS Secrets Manager with Kedro because of the hard reliance on a credentials.yml file. I have worked around this by hijacking the config registration and forcing a function call to get the secrets but it's definitely a hack.

Context

I think this change is important from a security standpoint but also for collaboration. I'm on a team of several people and we all share credentials for common data sources. It's hard enough to manage distinct credential files per project (in my space we have dozens of kedro projects and growing, and I have raised an issue on global credentials file here once before), but it's even more difficult to adhere to some security standards (like securing credentials via an enterprise approved storage solution) when the only option is a plain text file in a repo.

Possible Implementation

I think there could be something in the config registration that might make sense and an added key for DataSets that support a credentials key already.

@pypeaday pypeaday added the Issue: Feature Request New feature or improvement to existing feature label Feb 23, 2022
@Galileo-Galilei
Copy link
Member

I have no time for now, and it will likely take weeks before I came up with something intelligible, but this is a topic on which I plan to write a "Universal Kedro Deployment" issue. I think there are some adherence with this #770, but credentials have a lot of specificities indeed. In short my idea is that:

  • kedro should have an abstract class (~roughly similar to AbstractDataSet, say CredentialsManager) to implement the _get_credentials() function. It should be able to get credentials from anywhere (e.g. VaultCredentialsManager, GithubCredentialsManager and FileCredentialsManager which would default to current implementation) and return a dict of credentials.
  • This class should leverage the ConfigLoader when possible
  • This class should be parametrized in the settings.py

@antonymilne
Copy link
Contributor

antonymilne commented Feb 24, 2022

Thanks for the issue @nicpayne713 and your thoughts @Galileo-Galilei. I don't know anything about these credentials management systems, but what you both said here all sounds very sensible to me.

@nicpayne713 Just to understand better, could you outline in a bit more detail how you're solving this now?

As far as I know there's two possible ways of injecting environment variables into credentials in 0.17.x:

  1. Through the register_catalog hook as in your issue Global credentials file for multiple pipelines #930
  2. Through the register_config_loader hook, TemplatedConfigLoader and credentials.yaml

Although both the above use environment variables, there's no reason you can't inject arbitrary Python code into both these places to get credentials from elsewhere. So for AWS Secrets, where you don't have access to credentials through environment variables (is that right?) you instead need to inject some code like this. Please let me know if this is correct or if I've missed something here!

My immediate concern here is actually that, as it stands in 0.18, method 1 above would no longer be possible since the move of the DataCatalog.from_config call from a registration hook to a variable in settings.py does not expose the kwargs, so you wouldn't be able to customise what gets put into the credentials argument at all. Let me raise a separate issue to see if we should expose those kwargs so that this is still possible. Then you would be able to do something like this:

# settings.py
# You don't actually need to specify DATA_CATALOG_CLASS here since DataCatalog is the default value 
DATA_CATALOG_CLASS = DataCatalog 
# These arguments will get injected into the `DataCatalog.from_config` call
DATA_CATALOG_ARGS = {
    "credentials": get_aws_secrets(),
}

... which actually doesn't seem too hacky to me. Does this seem like a reasonable solution to you for the immediate future, or is the whole idea of credentials as a custom dictionary not compatible with the AWS secrets manager for some reason?

@Galileo-Galilei when do you write your thoughts up I'll be very interested in reading why a new CredentialsManager class would be preferable to the above DATA_CATALOG_ARGS solution. One problem I immediately see with the above is that it, unless we put in some custom logic to avoid this, it would overwrite any existing credentials.yaml definitions rather than merging credentials dictionaries together. So you'd have to choose just one credentials management system rather than being able to compose multiple ones together like you might do with classes. I don't know in practice whether that's a common requirement though?

@pypeaday
Copy link
Author

pypeaday commented Feb 24, 2022

@AntonyMilneQB, I'm not an AWS Secrets Manager expert but based on my experience I see that you can put any JSON as a secret and therefore also store simple key/value pairs. I don't know how injecting env variables would help unless each env variable was a DataSet key or something for a utility to query Secrets Manager and get back what it needs. As for the credentials to get into Secrets Manager, the boto SDK has several places it checks that I don't know the order of but it looks at env variables for AWS_SECRET_KEY and AWS_SECRET_ACCESS_KEY_ID or something plus a few other potential ones, it also looks in ~/.aws. for a credentials file and then based on env variables and aws config will use the credentials relevant for the specified profile. All that is just connecting to AWS though, I don't know of a boto-native way to use env variables to config Secrets Manager queries.

Storing credentials as a custom dictionary is definitely fine with Secrets Manager since a secret is just JSON.

My quick solution was to implement basically that exact code snippet via what @datajoely suggested in my other older issue #930 . I wrote a small library with a function get_secrets (just for abstraction from Secrets Manager in our projects) and then in the hook that registers the config I check if there's an empty dictionary or not ( in general this wouldn't be the right move if the goal was to support several credential management systems, but for our use case it does what I want ) and if the credentials dictionary is empty then it calls my get_secret function and gives it the catalog and returns a dictionary with the credentials that are necessary given that catalog (or blank sqlalchemy connection strings so that our tests can still pass ok). I wanted to write an explicit hook but I'm still newish to Kedro and had no idea how to get the credentials loaded up before the catalog registration was complete.

At first glace I like what @Galileo-Galilei and I'd like to help here how I can!

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Mar 21, 2022
lvijnck pushed a commit to lvijnck/kedro that referenced this issue Apr 7, 2022
@antonymilne
Copy link
Contributor

Closing this because we're starting to think about how to make it a reality now! Just to keep all the discussion in one place, let's continue in #1646.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

4 participants