Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Merged
merged 41 commits into from
Feb 26, 2025

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Feb 11, 2025

Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating BACK references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with. Also includes changes to QDAG nodes and hybrid conversion to account for the change in how terms are handled.

from collections.abc import Callable

import pytest
from test_utils import (
graph_fetcher,
)
from tpch_test_functions import (
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using these instead of having the duplicates lying around, since I was getting tired of updating the duplicates.

@knassre-bodo knassre-bodo changed the title Overhaul BACK and CALC to use downstreaming of aliases Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method Feb 13, 2025
Comment on lines -20 to -23
lps_back_lines_impl,
lps_back_lines_price_impl,
lps_back_supplier_impl,
lps_back_supplier_name_impl,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples were no longer valid

@knassre-bodo knassre-bodo marked this pull request as ready for review February 13, 2025 08:25
Copy link

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Effort Kian.
Well done. Code changes is doing what the comments are saying.

result = european_countries(name, n_custs=COUNT(customers))
pydough.to_df(result)
result = european_countries(n=COUNT(customers))
pydough.to_df(result, columns={"name": "name", "n_custs": "n"})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have name:name? I don't se it used in the example

Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is already part of european_countries since it includes all terms from nations, as well as the new defined term (n). The point of the dictionary is to say which columns from european_countries to include, and what to name them, if the inclusion/names differ from the normal behavior. IN this case, we're saying we want two columns: name (which refers to european_countries.name) and n_custs (which refers to european_countries.n)

def standalone_string(self) -> str:
return f"({self.calc_kwarg_strings(False)})"

def to_string(self) -> str:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was to_string deleted?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into the parent abstract class, due to common behavior

"rank_with_filters_c",
lambda: pd.DataFrame(
{
"size": [46, 47, 48, 49, 50],
"name": [
"pname": [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea here: the columns got reordered AND renamed.

Comment on lines +550 to +551
"p1": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5,
"p2": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here they got duplicated/renamed.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB


@property
@abstractmethod
def inherited_downstreamed_terms(self) -> set[str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one exists because of the multi_partition_access tests. Just having ancestral_mapping is insufficient if partition some data, compute an aggregate, partition on top of that, then step into the child data (should be able to access the first aggregate). The reason ancestral_mapping doesn't work is that this is that the first aggregate is no longer part of the ancestry (but is still available because it got "flushed" when the first partition got partitioned again). This property just allows us to specify the names of columns that got flushed, and therefore are accessible (w/o worrying about how they are accessed).

return BackReferenceExpression(
self, term_name, self.ancestral_mapping[term_name]
)
if term_name in self.inherited_downstreamed_terms:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it isn't in the ancestral mapping, but is in the inherited downstreamed terms, then we do some leap-frogging logic to figure out where it came from (because it could be nested a few times).

else:
assert context.ancestor_context is not None
context = context.ancestor_context
return Reference(context, term_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire point of the set is just so this expression can be returned. By the time we reach hybrid conversion, this expression will correspond to a valid reference that has already been down-streamed into the correct context.

@@ -461,42 +483,147 @@ def multi_partition_access_2():
# Identify transactions that are below the average number of shares for
# transactions of the same combinations of (customer, stock, type), or
# the same combination of (customer, stock), or the same customer.
grps_a = PARTITION(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also renamed the variables in the test for clarity.

yearly_data = PARTITION(
Orders.CALCULATE(year=YEAR(order_date)), name="orders", by=year
).CALCULATE(n_orders=COUNT(orders))
return TPCH.CALCULATE(best_year=MAX(yearly_data.n_orders))


def multi_partition_access_1():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 6 "multi_partition_access" tests are edge cases for the behavior of ancestral mapping, and are the only tests where the inherited downstreamed terms set comes up.

Comment on lines +498 to +499
cus_groups.ticks.typs.original_data.WHERE(
(shares < cus_tick_typ_avg_shares)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a prime example of where the inherited downstreamed terms comes into play: due to the interplay of partition by & accessing the partitioned data, cus_tick_typ_avg_shares is not in the ancestral mapping of original_data. Therefore, it isn't resolvable at the QDAG stage unless we have the inherited downstreamed terms set to carry over the implied ancestor terms from cust_tick_typ_groups into cust_tick_groups, and therefore into cus_groups and its descendants.

Copy link

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed last 5 commits.
Overall code seems to be doing what comments are saying. Couldn't think of other test cases.

@@ -812,7 +812,7 @@ def translate_partition_child(
)
join_keys: list[tuple[HybridExpr, HybridExpr]] = []
assert node.subtree.agg_keys is not None
for agg_key in node.subtree.agg_keys:
for agg_key in sorted(node.subtree.agg_keys, key=str):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not sorted anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the other changes caused a nondeterminism issue, which this corrects.



def bad_contains():
# Using `in` operator (calls __contains__)
return Orders("discount" in order.details)
Copy link
Contributor

@vineetg3 vineetg3 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this still illegal code? Or has the test case been changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just altered the code slightly to something that is only illegal because of in, as opposed to anything else.

return self.preceding_context.ancestral_mapping

@property
def inherited_downstreamed_terms(self) -> set[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's inherited_downstramed_terms for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in some strange edge cases involving PartitionBy. See the multi_partition_access test cases I referenced earlier in my comments on this PR.

@knassre-bodo knassre-bodo merged commit 1636bfa into main Feb 26, 2025
5 checks passed
@knassre-bodo knassre-bodo deleted the kian/back_overhaul branch February 26, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overhaul the way back-references and CALC work to eliminate the need for BACK syntax
3 participants