Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy filesystem operations at FileConfig #113

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

anatoly-scherbakov
Copy link

@anatoly-scherbakov anatoly-scherbakov commented Jun 21, 2019

Problem. The call dw.load_dataset('...', auto_update=True) won't run on AWS Lambda with the error:

[Errno 30] Read-only file system: '/home/sbx_user1051'

The crash happens at datadotworld.config:FileConfig.__init__, at the line

os.makedirs(path.dirname(self._config_file_path))

The reason of this is that the single directory you can write at in AWS Lambda is /tmp. Note the environment variables (DW_AUTH_TOKEN, DW_CACHE_DIR, DW_TMP_DIR) were correctly (as can be seen from the following) configured in Lambda panel, and the two latter ones were pointed to /tmp/dw_cache/ and /tmp/dw_tmp/ directories, respectively.

Thus, we expect DW to look into environment variables but it still reaches out to the readonly filesystem. The mechanism of this is located at datadotworld._get_instance() function.

Manual solution is straightforward: we can use dw.DataDotWorld(config=dw.EnvConfig()).load_dataset(...) - that works perfectly.

However, ChainConfig seems to be built specifically for use cases like this, with the intention of automatically guessing what config to use without unwanted side effects from other configs.

Automatic Solution, step 1: FileConfig operations. This pull request moves over the FS-dependent code into a function of FileConfig and wraps self._config_parser into a lazy property. A test (however clumsy) is written. Proves to work with AWS Lambda, but not to the full extent. It crashes:

...
  File "/var/task/datadotworld/__init__.py", line 101, in load_dataset
    auto_update=auto_update)
  File "/var/task/datadotworld/datadotworld.py", line 152, in load_dataset
    dataset_key, cache_dir)
  File "/var/task/datadotworld/client/api.py", line 494, in download_datapackage
    shutil.move(unzipped_dir, dest_dir)
   ...
  File "/var/lang/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)

Why?

Automatic Solution, step 2: Cache dir. Research shows that ChainConfig is looping through all configs in its chain and fetching the first one which returns a non-None value for the field it is in need right now. That's what happens for cache_dir, which is hardcoded in DefaultConfig to be path.expanduser('~/.dw/cache'). Since InlineConfig is the first element in the chain and directly derives this field from DefaultConfig, - ChainConfig never even asks EnvConfig for this field. Using the default config chain mechanism, we can never override cache_dir.

I am removing this default declaration from DefaultConfig and moving it over to FileConfig where it does not bother me, but I realize a) this might be a backwards incompatible change and b) there is no actual reason to put the default value specifically into FileConfig. The more correct mechanism would be probably to poll all configs in the chain and, only if none of them returned a valid value, get a default value out of the sleeve.

Would be happy to hear feedback and thoughts.

@markmarkoh
Copy link

@anatoly-scherbakov thanks for the pull request. I'm also surprised that setting DW_CACHE_DIR, DW_TMP_DIR didn't work by default, so thanks for digging in further and figuring out why they don't work. Getting this library to work seamlessly on a lambda and respect the ENV based configs is definitely a good move forward.

I'll circle back with members of the team to determine the best course of action.

Copy link
Contributor

@laconc laconc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed report and for putting this together!

This looks great. My concerns are that it does make backwards-incompatible changes to DefaultConfig as you alluded to, and breaks InlineConfig (by no longer providing the cache dir.) We're looking to keep compatibility until v2 but we'll certainly re-visit this situation then.

Moving the FS-dependent code into a function of FileConfig was the right call and we should definitely keep that.

One correction that I want to make is that InlineConfig isn't the first element in the ChainConfig chain. As you can see here, EnvConfig is looked at first. Also, this test, using this fixture, are testing that ChainConfig reads the DW_CACHE_DIR env var as expected, and that all seems correct.

I'm not certain what led to that last error but would you mind double-checking things? And please let me know if I misunderstood something.


return self._config_parser

def get_config_parser(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of consistency, would you mind adding a dunder to this method name since it's private?

Suggested change
def get_config_parser(self):
def __get_config_parser(self):

config = FileConfig(config_file_path='/foo/bar/baz/')

with pytest.raises(PermissionError):
config.get_config_parser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public accessor here is just config_parser since it's a property.

Suggested change
config.get_config_parser()
config.config_parser


def __validate_config(self):
config_parser = self.config_parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels unnecessary since it's only ever called once in this method; how about just calling self.config_parser down there?

@laconc
Copy link
Contributor

laconc commented Jul 12, 2019

Also, sorry about the issues with the unit tests. I've put a fix in and if you merge master, those should be good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants