-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up a path bug + add tests for it #91
Conversation
datalogistik/datalogistik.py
Outdated
def main(dataset=None): | ||
if dataset is None: | ||
dataset = cli.parse_args_and_get_dataset_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it easier to test with various dataset objects
try: | ||
|
||
# This should be in the cache, but needs to be converted | ||
close_dataset = dataset.Dataset(name="fanniemae_sample", format="parquet") | ||
|
||
with pytest.raises(SystemExit) as e: | ||
datalogistik.main(close_dataset) | ||
assert e.type == SystemExit | ||
assert e.value.code == 0 | ||
|
||
captured = json.loads(capsys.readouterr().out) | ||
assert captured["name"] == "fanniemae_sample" | ||
assert captured["format"] == "parquet" | ||
assert isinstance(captured["tables"], dict) | ||
|
||
finally: | ||
# cleanup all files that weren't there to start with | ||
for file in set(os.listdir(test_dir_path)) - set(start_files): | ||
shutil.rmtree(pathlib.Path(test_dir_path, file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right pattern for cleaning up a messy test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could put things in a tempdir with tempfile.TemporaryDirectory()
or whatever just in case the cleanup doesn't get run for whatever reason, but this seems pretty safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda like using a TemporaryDirectory which I know will be cleaned up no matter what. Then you don't need the try-finally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this will fix the failing windows test haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nods yeah that works generally, though here we're operating inside of the datalogistik cache.
I guess we could copy the cache fixture to the temp directory and then we know it's gone... I'll make an issue for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple questions
@@ -30,8 +30,9 @@ def finish(): | |||
sys.exit(0) | |||
|
|||
|
|||
def main(): | |||
dataset = cli.parse_args_and_get_dataset_info() | |||
def main(dataset=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Could we get a type hint here so I can see how to override this if I need to when messing with tests or whatever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh, after reading I guess this is a Dataset
instance, but more specifically the metadata for one that's for the right dataset, but not necessarily the right format. So a type hint wouldn't do it anyway; it'd have to be a comment or docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tmp_dir_name = pathlib.Path( | ||
output_file.parent, f"{output_file}.tmp" | ||
) | ||
tmp_dir_name = pathlib.Path(f"{output_file}.tmp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure what's going on here, but this is now fully relative, which is may not have been before. Are we sure there aren't situations in which this will write to a directory other than what's expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the bug I was correcting. It was actually still relative before, but then also adding the parent on to it. So if output_file was like foo/bar/file.parquet
this made tmp_dir_name
into /User/name/.datalogistik_cache/foo/bar/foo/bar/file.parquet
which didn't exist and 💣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I'm still minorly concerned about the relative-ness going wrong if this gets called from somewhere else; could you add another .parent
or make this relative to the root dir of the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#89 is working on restructuring the path getters (which is where this comes from in the first place). Let's address how those paths are treated systematically there, since it'll all change over there anyway (and will catch other possible places that we might need to be paranoid)
try: | ||
|
||
# This should be in the cache, but needs to be converted | ||
close_dataset = dataset.Dataset(name="fanniemae_sample", format="parquet") | ||
|
||
with pytest.raises(SystemExit) as e: | ||
datalogistik.main(close_dataset) | ||
assert e.type == SystemExit | ||
assert e.value.code == 0 | ||
|
||
captured = json.loads(capsys.readouterr().out) | ||
assert captured["name"] == "fanniemae_sample" | ||
assert captured["format"] == "parquet" | ||
assert isinstance(captured["tables"], dict) | ||
|
||
finally: | ||
# cleanup all files that weren't there to start with | ||
for file in set(os.listdir(test_dir_path)) - set(start_files): | ||
shutil.rmtree(pathlib.Path(test_dir_path, file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could put things in a tempdir with tempfile.TemporaryDirectory()
or whatever just in case the cleanup doesn't get run for whatever reason, but this seems pretty safe
tests/test_tpc.py
Outdated
@@ -19,7 +19,7 @@ | |||
|
|||
import pytest | |||
|
|||
from datalogistik import datalogistik, tpc_validation | |||
from datalogistik import cli, datalogistik, tpc_validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either this was failing before or this is unused, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sure why flake8 in CLI complained only now about this, but it did
The copy-to-temp step had a small bug where it was repeating the path twice in some circumstances. This addresses that + adds two integration tests that confirm that at least one pass through these will succeed.