Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up a path bug + add tests for it #91

Merged
merged 7 commits into from
Sep 21, 2022
Merged

Clean up a path bug + add tests for it #91

merged 7 commits into from
Sep 21, 2022

Conversation

jonkeane
Copy link
Contributor

The copy-to-temp step had a small bug where it was repeating the path twice in some circumstances. This addresses that + adds two integration tests that confirm that at least one pass through these will succeed.

Comment on lines 33 to 35
def main(dataset=None):
if dataset is None:
dataset = cli.parse_args_and_get_dataset_info()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it easier to test with various dataset objects

Comment on lines +364 to +382
try:

# This should be in the cache, but needs to be converted
close_dataset = dataset.Dataset(name="fanniemae_sample", format="parquet")

with pytest.raises(SystemExit) as e:
datalogistik.main(close_dataset)
assert e.type == SystemExit
assert e.value.code == 0

captured = json.loads(capsys.readouterr().out)
assert captured["name"] == "fanniemae_sample"
assert captured["format"] == "parquet"
assert isinstance(captured["tables"], dict)

finally:
# cleanup all files that weren't there to start with
for file in set(os.listdir(test_dir_path)) - set(start_files):
shutil.rmtree(pathlib.Path(test_dir_path, file))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right pattern for cleaning up a messy test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could put things in a tempdir with tempfile.TemporaryDirectory() or whatever just in case the cleanup doesn't get run for whatever reason, but this seems pretty safe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda like using a TemporaryDirectory which I know will be cleaned up no matter what. Then you don't need the try-finally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this will fix the failing windows test haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nods yeah that works generally, though here we're operating inside of the datalogistik cache.

I guess we could copy the cache fixture to the temp directory and then we know it's gone... I'll make an issue for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#92

tests/test_tpc.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@alistaire47 alistaire47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple questions

@@ -30,8 +30,9 @@ def finish():
sys.exit(0)


def main():
dataset = cli.parse_args_and_get_dataset_info()
def main(dataset=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Could we get a type hint here so I can see how to override this if I need to when messing with tests or whatever?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, after reading I guess this is a Dataset instance, but more specifically the metadata for one that's for the right dataset, but not necessarily the right format. So a type hint wouldn't do it anyway; it'd have to be a comment or docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

tmp_dir_name = pathlib.Path(
output_file.parent, f"{output_file}.tmp"
)
tmp_dir_name = pathlib.Path(f"{output_file}.tmp")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what's going on here, but this is now fully relative, which is may not have been before. Are we sure there aren't situations in which this will write to a directory other than what's expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the bug I was correcting. It was actually still relative before, but then also adding the parent on to it. So if output_file was like foo/bar/file.parquet this made tmp_dir_name into /User/name/.datalogistik_cache/foo/bar/foo/bar/file.parquet which didn't exist and 💣

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I'm still minorly concerned about the relative-ness going wrong if this gets called from somewhere else; could you add another .parent or make this relative to the root dir of the cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#89 is working on restructuring the path getters (which is where this comes from in the first place). Let's address how those paths are treated systematically there, since it'll all change over there anyway (and will catch other possible places that we might need to be paranoid)

Comment on lines +364 to +382
try:

# This should be in the cache, but needs to be converted
close_dataset = dataset.Dataset(name="fanniemae_sample", format="parquet")

with pytest.raises(SystemExit) as e:
datalogistik.main(close_dataset)
assert e.type == SystemExit
assert e.value.code == 0

captured = json.loads(capsys.readouterr().out)
assert captured["name"] == "fanniemae_sample"
assert captured["format"] == "parquet"
assert isinstance(captured["tables"], dict)

finally:
# cleanup all files that weren't there to start with
for file in set(os.listdir(test_dir_path)) - set(start_files):
shutil.rmtree(pathlib.Path(test_dir_path, file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could put things in a tempdir with tempfile.TemporaryDirectory() or whatever just in case the cleanup doesn't get run for whatever reason, but this seems pretty safe

@@ -19,7 +19,7 @@

import pytest

from datalogistik import datalogistik, tpc_validation
from datalogistik import cli, datalogistik, tpc_validation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this was failing before or this is unused, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure why flake8 in CLI complained only now about this, but it did

@jonkeane jonkeane merged commit 769e870 into main Sep 21, 2022
@jonkeane jonkeane deleted the convert_has_output branch September 21, 2022 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants