-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy filesystem operations at FileConfig #113
base: main
Are you sure you want to change the base?
Lazy filesystem operations at FileConfig #113
Conversation
@anatoly-scherbakov thanks for the pull request. I'm also surprised that setting I'll circle back with members of the team to determine the best course of action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed report and for putting this together!
This looks great. My concerns are that it does make backwards-incompatible changes to DefaultConfig as you alluded to, and breaks InlineConfig (by no longer providing the cache dir.) We're looking to keep compatibility until v2 but we'll certainly re-visit this situation then.
Moving the FS-dependent code into a function of FileConfig was the right call and we should definitely keep that.
One correction that I want to make is that InlineConfig
isn't the first element in the ChainConfig
chain. As you can see here, EnvConfig
is looked at first. Also, this test, using this fixture, are testing that ChainConfig
reads the DW_CACHE_DIR env var as expected, and that all seems correct.
I'm not certain what led to that last error but would you mind double-checking things? And please let me know if I misunderstood something.
|
||
return self._config_parser | ||
|
||
def get_config_parser(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of consistency, would you mind adding a dunder to this method name since it's private?
def get_config_parser(self): | |
def __get_config_parser(self): |
config = FileConfig(config_file_path='/foo/bar/baz/') | ||
|
||
with pytest.raises(PermissionError): | ||
config.get_config_parser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public accessor here is just config_parser
since it's a property.
config.get_config_parser() | |
config.config_parser |
|
||
def __validate_config(self): | ||
config_parser = self.config_parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels unnecessary since it's only ever called once in this method; how about just calling self.config_parser
down there?
Also, sorry about the issues with the unit tests. I've put a fix in and if you merge master, those should be good now. |
Problem. The call
dw.load_dataset('...', auto_update=True)
won't run on AWS Lambda with the error:The crash happens at
datadotworld.config:FileConfig.__init__
, at the lineThe reason of this is that the single directory you can write at in AWS Lambda is
/tmp
. Note the environment variables (DW_AUTH_TOKEN
,DW_CACHE_DIR
,DW_TMP_DIR
) were correctly (as can be seen from the following) configured in Lambda panel, and the two latter ones were pointed to/tmp/dw_cache/
and/tmp/dw_tmp/
directories, respectively.Thus, we expect DW to look into environment variables but it still reaches out to the readonly filesystem. The mechanism of this is located at
datadotworld._get_instance()
function.Manual solution is straightforward: we can use
dw.DataDotWorld(config=dw.EnvConfig()).load_dataset(...)
- that works perfectly.However,
ChainConfig
seems to be built specifically for use cases like this, with the intention of automatically guessing what config to use without unwanted side effects from other configs.Automatic Solution, step 1: FileConfig operations. This pull request moves over the FS-dependent code into a function of
FileConfig
and wrapsself._config_parser
into a lazy property. A test (however clumsy) is written. Proves to work with AWS Lambda, but not to the full extent. It crashes:Why?
Automatic Solution, step 2: Cache dir. Research shows that
ChainConfig
is looping through all configs in its chain and fetching the first one which returns a non-None
value for the field it is in need right now. That's what happens forcache_dir
, which is hardcoded inDefaultConfig
to bepath.expanduser('~/.dw/cache')
. SinceInlineConfig
is the first element in the chain and directly derives this field fromDefaultConfig
, -ChainConfig
never even asksEnvConfig
for this field. Using the default config chain mechanism, we can never overridecache_dir
.I am removing this default declaration from
DefaultConfig
and moving it over toFileConfig
where it does not bother me, but I realize a) this might be a backwards incompatible change and b) there is no actual reason to put the default value specifically intoFileConfig
. The more correct mechanism would be probably to poll all configs in the chain and, only if none of them returned a valid value, get a default value out of the sleeve.Would be happy to hear feedback and thoughts.