Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45517: No longer check parent when checking compatibility #1044

Merged
merged 5 commits into from
Aug 1, 2024

Conversation

parejkoj
Copy link
Contributor

Checklist

  • ran Jenkins
  • [??] added a release note for user-visible changes to doc/changes

I'm not sure whether this is a "user-visible" change, and if so, I don't know where to put it in the docs.

@parejkoj parejkoj changed the title Support conversion of Parquet storage class components DM-45517: Support conversion of Parquet storage class components Jul 31, 2024
Copy link

codecov bot commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.38%. Comparing base (28efc2a) to head (7570ecb).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1044   +/-   ##
=======================================
  Coverage   89.37%   89.38%           
=======================================
  Files         359      359           
  Lines       45647    45657   +10     
  Branches     9360     9362    +2     
=======================================
+ Hits        40799    40811   +12     
+ Misses       3521     3520    -1     
+ Partials     1327     1326    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs discussion. The fix is not correct.

@@ -642,6 +642,13 @@ def testWriteSingleIndexDataFrameReadAsArrowTable(self):
if schema.field(name).type not in (pa.string(), pa.binary()):
self.assertEqual(schema.field(name).type, schema2.field(name).type)

# Check getting a component while overriding the storage class via
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the test up to line 631 because this test is doing the same test but rather than specifying a storage class override on the get() it's doing the override in the DatasetType.

@@ -642,6 +642,13 @@ def testWriteSingleIndexDataFrameReadAsArrowTable(self):
if schema.field(name).type not in (pa.string(), pa.binary()):
self.assertEqual(schema.field(name).type, schema2.field(name).type)

# Check getting a component while overriding the storage class via
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check getting a component while overriding the storage class via
# Check getting a component while overriding the parent storage class via

This is an important distinction because what's happening is that you are changing from DataFrame to ArrowAstropy and then using columns which ends up with a different definition automatically, whereas above the test is using DataFrame.columns and then changing the storage class.

This code works already:

columnsType = self.datasetType.makeComponentDatasetType("columns").overrideStorageClass("ArrowColumnList")
self.assertEqual(columns2, self.butler.get(columnsType))

but could you add that test as well to make it clear that the order of the component making / overriding matters?

# Support conversion when the components have the same name and
# convertible parents, e.g. Parquet storage classes.
if self.component() is not None and self.component() == other.component():
return self.makeCompositeDatasetType().is_compatible_with(other.makeCompositeDatasetType())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right fix. In my testing if I have two composite dataset types that have components that are incompatible this says that everything is fine. We have to test somewhere that the components are compatible.

The parquet test is failing for a different reason. It's failing because _equal_ignoring_storage_class is not ignoring parent storage class and so this returns False before the actual storage class is checked. I think the fix here is to think harder about the parent storage class in this context. Maybe is_compatible_with has to check parent compatibility and component compatibility or maybe it doesn't care about the parent at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we could declare that two component dataset types are compatible if either of the following is true:

  • if the component storage classes are directly compatible (this is the only supported case on main);
  • if the parent storage classes are compatible and the component names are the same (the condition added on this branch, but I think it's implied to already work by the other way the same get is already written on the test).

The specific situation here is that as long as is_compatible_with doesn't return False for ArrowAstropy.columns vs. DataFrame.columns, the way we call the Formatter already works, and we'd rather let the Formatter do the conversion anyway. So I'd rather not have to define explicit conversions between the two component storage classes in this case, since that wouldn't really reflect how the conversion works.

I am somewhat worried that this would lead us to declare two storage classes as convertible when a conversion doesn't actually exist (except in the formatter), but I don't see that coming up in practice in this case so I wasn't sure how much to worry about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better if the new check here moved up to the place where we call is_compatible_with in Butler.get, so that my proposed new rule on component compatibility is not a general one, but rather a chance to let the formatter handle a request where the Butler cannot rule out a conversion working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant columns storage class definitions already register converters and are compatible (that's the reason why the test above the new test was working since the storage class override on the butler.get() was specifying the column storage class).

I think if we start saying that conversions for components don't have to work if the parents can convert that might break the storageClass=Override option for components. I don't like the idea of not checking component compatibility at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is_compatible_with not checking parent storage class but requiring that components are convertible is good enough isn't it? The test passes now if I remove the parent storage class check. We can assume that if you are comparing components you don't care about the parent.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm surprised but happen that the component storage classes we care about here are already compatible without any extra effort.

@timj
Copy link
Member

timj commented Aug 1, 2024

I'm surprised but happy that the component storage classes we care about here are already compatible

If they weren't then many of the tests that already existed for parquet would have failed since that's how storageClass= overrides worked if you asked for columns. @erykoff added the converters up front.

@timj timj changed the title DM-45517: Support conversion of Parquet storage class components DM-45517: No longer check parent when checking compatibility Aug 1, 2024
@parejkoj parejkoj merged commit 106dc9f into main Aug 1, 2024
18 checks passed
@parejkoj parejkoj deleted the tickets/DM-45517 branch August 1, 2024 18:27
@timj
Copy link
Member

timj commented Aug 1, 2024

@parejkoj I was expecting to merge this once my Jenkins run completed.

@parejkoj
Copy link
Contributor Author

parejkoj commented Aug 1, 2024

Oops, sorry! We can revert if jenkins fails.

@timj
Copy link
Member

timj commented Aug 1, 2024

Jenkins passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants