-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256
Conversation
from collections.abc import Callable | ||
|
||
import pytest | ||
from test_utils import ( | ||
graph_fetcher, | ||
) | ||
from tpch_test_functions import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using these instead of having the duplicates lying around, since I was getting tired of updating the duplicates.
…ition/partition_child edge cases
… for hybrid and update remainign tpch queries
…h a hack to avoid correlate generation
lps_back_lines_impl, | ||
lps_back_lines_price_impl, | ||
lps_back_supplier_impl, | ||
lps_back_supplier_name_impl, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples were no longer valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Effort Kian.
Well done. Code changes is doing what the comments are saying.
result = european_countries(name, n_custs=COUNT(customers)) | ||
pydough.to_df(result) | ||
result = european_countries(n=COUNT(customers)) | ||
pydough.to_df(result, columns={"name": "name", "n_custs": "n"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have name:name
? I don't se it used in the example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
is already part of european_countries
since it includes all terms from nations
, as well as the new defined term (n
). The point of the dictionary is to say which columns from european_countries
to include, and what to name them, if the inclusion/names differ from the normal behavior. IN this case, we're saying we want two columns: name
(which refers to european_countries.name
) and n_custs
(which refers to european_countries.n
)
def standalone_string(self) -> str: | ||
return f"({self.calc_kwarg_strings(False)})" | ||
|
||
def to_string(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was to_string
deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into the parent abstract class, due to common behavior
"rank_with_filters_c", | ||
lambda: pd.DataFrame( | ||
{ | ||
"size": [46, 47, 48, 49, 50], | ||
"name": [ | ||
"pname": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea here: the columns got reordered AND renamed.
"p1": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5, | ||
"p2": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here they got duplicated/renamed.
Co-authored-by: Hadia Ahmed <[email protected]>
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
||
@property | ||
@abstractmethod | ||
def inherited_downstreamed_terms(self) -> set[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one exists because of the multi_partition_access
tests. Just having ancestral_mapping
is insufficient if partition some data, compute an aggregate, partition on top of that, then step into the child data (should be able to access the first aggregate). The reason ancestral_mapping
doesn't work is that this is that the first aggregate is no longer part of the ancestry (but is still available because it got "flushed" when the first partition got partitioned again). This property just allows us to specify the names of columns that got flushed, and therefore are accessible (w/o worrying about how they are accessed).
return BackReferenceExpression( | ||
self, term_name, self.ancestral_mapping[term_name] | ||
) | ||
if term_name in self.inherited_downstreamed_terms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it isn't in the ancestral mapping, but is in the inherited downstreamed terms, then we do some leap-frogging logic to figure out where it came from (because it could be nested a few times).
else: | ||
assert context.ancestor_context is not None | ||
context = context.ancestor_context | ||
return Reference(context, term_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire point of the set is just so this expression can be returned. By the time we reach hybrid conversion, this expression will correspond to a valid reference that has already been down-streamed into the correct context.
@@ -461,42 +483,147 @@ def multi_partition_access_2(): | |||
# Identify transactions that are below the average number of shares for | |||
# transactions of the same combinations of (customer, stock, type), or | |||
# the same combination of (customer, stock), or the same customer. | |||
grps_a = PARTITION( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also renamed the variables in the test for clarity.
yearly_data = PARTITION( | ||
Orders.CALCULATE(year=YEAR(order_date)), name="orders", by=year | ||
).CALCULATE(n_orders=COUNT(orders)) | ||
return TPCH.CALCULATE(best_year=MAX(yearly_data.n_orders)) | ||
|
||
|
||
def multi_partition_access_1(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 6 "multi_partition_access" tests are edge cases for the behavior of ancestral mapping, and are the only tests where the inherited downstreamed terms set comes up.
cus_groups.ticks.typs.original_data.WHERE( | ||
(shares < cus_tick_typ_avg_shares) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a prime example of where the inherited downstreamed terms comes into play: due to the interplay of partition by & accessing the partitioned data, cus_tick_typ_avg_shares
is not in the ancestral mapping of original_data
. Therefore, it isn't resolvable at the QDAG stage unless we have the inherited downstreamed terms set to carry over the implied ancestor terms from cust_tick_typ_groups
into cust_tick_groups
, and therefore into cus_groups
and its descendants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed last 5 commits.
Overall code seems to be doing what comments are saying. Couldn't think of other test cases.
@@ -812,7 +812,7 @@ def translate_partition_child( | |||
) | |||
join_keys: list[tuple[HybridExpr, HybridExpr]] = [] | |||
assert node.subtree.agg_keys is not None | |||
for agg_key in node.subtree.agg_keys: | |||
for agg_key in sorted(node.subtree.agg_keys, key=str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not sorted anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the other changes caused a nondeterminism issue, which this corrects.
|
||
|
||
def bad_contains(): | ||
# Using `in` operator (calls __contains__) | ||
return Orders("discount" in order.details) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this still illegal code? Or has the test case been changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just altered the code slightly to something that is only illegal because of in
, as opposed to anything else.
return self.preceding_context.ancestral_mapping | ||
|
||
@property | ||
def inherited_downstreamed_terms(self) -> set[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's inherited_downstramed_terms
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in some strange edge cases involving PartitionBy
. See the multi_partition_access
test cases I referenced earlier in my comments on this PR.
Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating
BACK
references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with. Also includes changes to QDAG nodes and hybrid conversion to account for the change in how terms are handled.