-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-45517: No longer check parent when checking compatibility #1044
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1044 +/- ##
=======================================
Coverage 89.37% 89.38%
=======================================
Files 359 359
Lines 45647 45657 +10
Branches 9360 9362 +2
=======================================
+ Hits 40799 40811 +12
+ Misses 3521 3520 -1
+ Partials 1327 1326 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs discussion. The fix is not correct.
tests/test_parquet.py
Outdated
@@ -642,6 +642,13 @@ def testWriteSingleIndexDataFrameReadAsArrowTable(self): | |||
if schema.field(name).type not in (pa.string(), pa.binary()): | |||
self.assertEqual(schema.field(name).type, schema2.field(name).type) | |||
|
|||
# Check getting a component while overriding the storage class via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the test up to line 631 because this test is doing the same test but rather than specifying a storage class override on the get() it's doing the override in the DatasetType.
tests/test_parquet.py
Outdated
@@ -642,6 +642,13 @@ def testWriteSingleIndexDataFrameReadAsArrowTable(self): | |||
if schema.field(name).type not in (pa.string(), pa.binary()): | |||
self.assertEqual(schema.field(name).type, schema2.field(name).type) | |||
|
|||
# Check getting a component while overriding the storage class via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Check getting a component while overriding the storage class via | |
# Check getting a component while overriding the parent storage class via |
This is an important distinction because what's happening is that you are changing from DataFrame to ArrowAstropy and then using columns which ends up with a different definition automatically, whereas above the test is using DataFrame.columns and then changing the storage class.
This code works already:
columnsType = self.datasetType.makeComponentDatasetType("columns").overrideStorageClass("ArrowColumnList")
self.assertEqual(columns2, self.butler.get(columnsType))
but could you add that test as well to make it clear that the order of the component making / overriding matters?
# Support conversion when the components have the same name and | ||
# convertible parents, e.g. Parquet storage classes. | ||
if self.component() is not None and self.component() == other.component(): | ||
return self.makeCompositeDatasetType().is_compatible_with(other.makeCompositeDatasetType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right fix. In my testing if I have two composite dataset types that have components that are incompatible this says that everything is fine. We have to test somewhere that the components are compatible.
The parquet test is failing for a different reason. It's failing because _equal_ignoring_storage_class
is not ignoring parent storage class and so this returns False before the actual storage class is checked. I think the fix here is to think harder about the parent storage class in this context. Maybe is_compatible_with has to check parent compatibility and component compatibility or maybe it doesn't care about the parent at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that we could declare that two component dataset types are compatible if either of the following is true:
- if the component storage classes are directly compatible (this is the only supported case on
main
); - if the parent storage classes are compatible and the component names are the same (the condition added on this branch, but I think it's implied to already work by the other way the same
get
is already written on the test).
The specific situation here is that as long as is_compatible_with
doesn't return False
for ArrowAstropy.columns
vs. DataFrame.columns
, the way we call the Formatter
already works, and we'd rather let the Formatter
do the conversion anyway. So I'd rather not have to define explicit conversions between the two component storage classes in this case, since that wouldn't really reflect how the conversion works.
I am somewhat worried that this would lead us to declare two storage classes as convertible when a conversion doesn't actually exist (except in the formatter), but I don't see that coming up in practice in this case so I wasn't sure how much to worry about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better if the new check here moved up to the place where we call is_compatible_with
in Butler.get
, so that my proposed new rule on component compatibility is not a general one, but rather a chance to let the formatter handle a request where the Butler
cannot rule out a conversion working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant columns
storage class definitions already register converters and are compatible (that's the reason why the test above the new test was working since the storage class override on the butler.get()
was specifying the column storage class).
I think if we start saying that conversions for components don't have to work if the parents can convert that might break the storageClass=Override
option for components. I don't like the idea of not checking component compatibility at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is_compatible_with
not checking parent storage class but requiring that components are convertible is good enough isn't it? The test passes now if I remove the parent storage class check. We can assume that if you are comparing components you don't care about the parent.
…tibility Datastore requires that components are compatible. Parent compatibility is not important when the component python object is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'm surprised but happen that the component storage classes we care about here are already compatible without any extra effort.
If they weren't then many of the tests that already existed for parquet would have failed since that's how |
@parejkoj I was expecting to merge this once my Jenkins run completed. |
Oops, sorry! We can revert if jenkins fails. |
Jenkins passed. |
Checklist
doc/changes
I'm not sure whether this is a "user-visible" change, and if so, I don't know where to put it in the docs.