Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

knassre-bodo · 2025-02-11T02:07:50Z

Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating BACK references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with. Also includes changes to QDAG nodes and hybrid conversion to account for the change in how terms are handled.

knassre-bodo · 2025-02-12T20:35:33Z

tests/test_qualification.py

 from collections.abc import Callable

 import pytest
 from test_utils import (
    graph_fetcher,
 )
+from tpch_test_functions import (


Switched to using these instead of having the duplicates lying around, since I was getting tired of updating the duplicates.

…ition/partition_child edge cases

… for hybrid and update remainign tpch queries

…h a hack to avoid correlate generation

…d calc & back

knassre-bodo · 2025-02-13T08:23:33Z

tests/test_exploration.py

-    lps_back_lines_impl,
-    lps_back_lines_price_impl,
-    lps_back_supplier_impl,
-    lps_back_supplier_name_impl,


These examples were no longer valid

…ck_overhaul

hadia206

Great Effort Kian.
Well done. Code changes is doing what the comments are saying.

documentation/functions.md

documentation/usage.md

hadia206 · 2025-02-20T18:36:58Z

documentation/usage.md

-result = european_countries(name, n_custs=COUNT(customers))
-pydough.to_df(result)
+result = european_countries(n=COUNT(customers))
+pydough.to_df(result, columns={"name": "name", "n_custs": "n"})


Why do we have name:name? I don't se it used in the example

name is already part of european_countries since it includes all terms from nations, as well as the new defined term (n). The point of the dictionary is to say which columns from european_countries to include, and what to name them, if the inclusion/names differ from the normal behavior. IN this case, we're saying we want two columns: name (which refers to european_countries.name) and n_custs (which refers to european_countries.n)

pydough/conversion/relational_converter.py

hadia206 · 2025-02-20T18:47:35Z

pydough/qdag/collections/calculate.py

    def standalone_string(self) -> str:
-        return f"({self.calc_kwarg_strings(False)})"
-
-    def to_string(self) -> str:


Why was to_string deleted?

Moved into the parent abstract class, due to common behavior

documentation/dsl.md

knassre-bodo · 2025-02-21T19:38:10Z

tests/test_pipeline.py

                "rank_with_filters_c",
                lambda: pd.DataFrame(
                    {
-                        "size": [46, 47, 48, 49, 50],
-                        "name": [
+                        "pname": [


Same idea here: the columns got reordered AND renamed.

knassre-bodo · 2025-02-21T19:38:27Z

tests/test_pipeline.py

+                        "p1": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5,
+                        "p2": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5,


And here they got duplicated/renamed.

Co-authored-by: Hadia Ahmed <[email protected]>

…ck_overhaul

…ted ancestries

…N CI]

review-notebook-app · 2025-02-24T18:30:30Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

knassre-bodo · 2025-02-24T19:01:22Z

pydough/qdag/collections/collection_qdag.py

+
+    @property
+    @abstractmethod
+    def inherited_downstreamed_terms(self) -> set[str]:


This one exists because of the multi_partition_access tests. Just having ancestral_mapping is insufficient if partition some data, compute an aggregate, partition on top of that, then step into the child data (should be able to access the first aggregate). The reason ancestral_mapping doesn't work is that this is that the first aggregate is no longer part of the ancestry (but is still available because it got "flushed" when the first partition got partitioned again). This property just allows us to specify the names of columns that got flushed, and therefore are accessible (w/o worrying about how they are accessed).

knassre-bodo · 2025-02-24T19:02:16Z

pydough/qdag/collections/partition_child.py

+            return BackReferenceExpression(
+                self, term_name, self.ancestral_mapping[term_name]
+            )
+        if term_name in self.inherited_downstreamed_terms:


If it isn't in the ancestral mapping, but is in the inherited downstreamed terms, then we do some leap-frogging logic to figure out where it came from (because it could be nested a few times).

knassre-bodo · 2025-02-24T19:15:02Z

pydough/qdag/collections/partition_child.py

+                else:
+                    assert context.ancestor_context is not None
+                    context = context.ancestor_context
+            return Reference(context, term_name)


The entire point of the set is just so this expression can be returned. By the time we reach hybrid conversion, this expression will correspond to a valid reference that has already been down-streamed into the correct context.

knassre-bodo · 2025-02-24T19:15:55Z

tests/simple_pydough_functions.py

@@ -461,42 +483,147 @@ def multi_partition_access_2():
    # Identify transactions that are below the average number of shares for
    # transactions of the same combinations of (customer, stock, type), or
    # the same combination of (customer, stock), or the same customer.
-    grps_a = PARTITION(


Also renamed the variables in the test for clarity.

knassre-bodo · 2025-02-24T19:16:13Z

tests/simple_pydough_functions.py

+    yearly_data = PARTITION(
+        Orders.CALCULATE(year=YEAR(order_date)), name="orders", by=year
+    ).CALCULATE(n_orders=COUNT(orders))
+    return TPCH.CALCULATE(best_year=MAX(yearly_data.n_orders))


 def multi_partition_access_1():


These 6 "multi_partition_access" tests are edge cases for the behavior of ancestral mapping, and are the only tests where the inherited downstreamed terms set comes up.

knassre-bodo · 2025-02-24T19:18:19Z

tests/simple_pydough_functions.py

+        cus_groups.ticks.typs.original_data.WHERE(
+            (shares < cus_tick_typ_avg_shares)


This is a prime example of where the inherited downstreamed terms comes into play: due to the interplay of partition by & accessing the partitioned data, cus_tick_typ_avg_shares is not in the ancestral mapping of original_data. Therefore, it isn't resolvable at the QDAG stage unless we have the inherited downstreamed terms set to carry over the implied ancestor terms from cust_tick_typ_groups into cust_tick_groups, and therefore into cus_groups and its descendants.

hadia206

Reviewed last 5 commits.
Overall code seems to be doing what comments are saying. Couldn't think of other test cases.

demos/notebooks/2_pydough_operations.ipynb

hadia206 · 2025-02-25T18:35:17Z

pydough/conversion/relational_converter.py

@@ -812,7 +812,7 @@ def translate_partition_child(
        )
        join_keys: list[tuple[HybridExpr, HybridExpr]] = []
        assert node.subtree.agg_keys is not None
-        for agg_key in node.subtree.agg_keys:
+        for agg_key in sorted(node.subtree.agg_keys, key=str):


why not sorted anymore?

One of the other changes caused a nondeterminism issue, which this corrects.

documentation/usage.md

vineetg3 · 2025-02-26T08:21:54Z

tests/bad_pydough_functions.py



 def bad_contains():
    # Using `in` operator (calls __contains__)
-    return Orders("discount" in order.details)


Isn't this still illegal code? Or has the test case been changed?

I just altered the code slightly to something that is only illegal because of in, as opposed to anything else.

vineetg3 · 2025-02-26T17:11:44Z

pydough/qdag/collections/augmenting_child_operator.py

+        return self.preceding_context.ancestral_mapping
+
+    @property
+    def inherited_downstreamed_terms(self) -> set[str]:


What's inherited_downstramed_terms for?

This is used in some strange edge cases involving PartitionBy. See the multi_partition_access test cases I referenced earlier in my comments on this PR.

demos/notebooks/2_pydough_operations.ipynb

knassre-bodo added 8 commits February 10, 2025 21:06

started mass renaming and addition of the ancestral mapping

096c752

Fixing imports, allowing reuse-under-same-name uses

19315f8

Cleaning up some tests and error handling to allow non-renaming accesses

580315a

Adjusting how the ancestral mapping works

4cb488f

Working on purge of back reference collection

6c5ff8e

Begining massive purge rename of () to .CALCULATE()

0e71711

Continuing renaming purge

7100472

Continuing the back/calc rename purge

ae7dacc

knassre-bodo commented Feb 12, 2025

View reviewed changes

knassre-bodo added 8 commits February 12, 2025 16:54

Continued fixing exploration, updated all_terms handling, fixing part…

f32de87

…ition/partition_child edge cases

Hybrid handling for new BACK semantics, need to fix partition handing…

b5fa65e

… for hybrid and update remainign tpch queries

Working on hybrid partition cases, all queries working except q11 wit…

b80205d

…h a hack to avoid correlate generation

Fixing extreme edge case bug

42acc02

Resolving merge conflicts

aee583f

Fixing defog functions

4307846

Cleanup of correlate avoiding case [RUN CI]

2942171

updated core spec docs, need to finish updating notebooks to purge ol…

9aa294a

…d calc & back

knassre-bodo changed the title ~~Overhaul BACK and CALC to use downstreaming of aliases~~ Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method Feb 13, 2025

knassre-bodo commented Feb 13, 2025

View reviewed changes

knassre-bodo requested review from vineetg3, amullerbodo, gzbodo, aramirez-bodo and J-Solano-bodo-ai February 13, 2025 08:24

knassre-bodo marked this pull request as ready for review February 13, 2025 08:25

knassre-bodo added 5 commits February 13, 2025 03:25

[RUN CI]

1beb081

Added to_sql and to_df keyword argument for columns

ddd8d3b

Testing to_sql with columns arg

c2084d0

Added to_df tests

fee822f

Updating usage doc [RUN CI]

09372ab

knassre-bodo added 2 commits February 20, 2025 13:25

Getting remaining tests online [RUN CI]

8b0aff9

Merge remote-tracking branch 'origin/kian/back_overhaul' into kian/ba…

a8e604c

…ck_overhaul

hadia206 reviewed Feb 20, 2025

View reviewed changes

knassre-bodo commented Feb 21, 2025

View reviewed changes

knassre-bodo and others added 6 commits February 21, 2025 14:44

Further revisions

ac26ece

Update pydough/conversion/relational_converter.py

588501e

Co-authored-by: Hadia Ahmed <[email protected]>

Making agg join keys deterministically sorted

4b81f1b

Merge remote-tracking branch 'origin/kian/back_overhaul' into kian/ba…

68f614c

…ck_overhaul

Overhaul to how partition child handles backrefs to reconcile convolu…

2502a97

…ted ancestries

Added one more extreme edge case for the partition child behavior [RU…

764e134

…N CI]

Resolving conflicts [RUN CI]

8c155e5

knassre-bodo commented Feb 24, 2025

View reviewed changes

knassre-bodo requested a review from hadia206 February 24, 2025 21:01

hadia206 approved these changes Feb 25, 2025

View reviewed changes

vineetg3 reviewed Feb 25, 2025

View reviewed changes

documentation/usage.md Show resolved Hide resolved

vineetg3 reviewed Feb 26, 2025

View reviewed changes

knassre-bodo requested a review from vineetg3 February 26, 2025 17:09

vineetg3 reviewed Feb 26, 2025

View reviewed changes

Revising notebooks [RUN CI]

2fe4d4d

vineetg3 reviewed Feb 26, 2025

View reviewed changes

demos/notebooks/2_pydough_operations.ipynb Show resolved Hide resolved

demos/notebooks/2_pydough_operations.ipynb Show resolved Hide resolved

Revisions [RUN CI]

8c70a13

knassre-bodo merged commit 1636bfa into main Feb 26, 2025
5 checks passed

knassre-bodo deleted the kian/back_overhaul branch February 26, 2025 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

knassre-bodo commented Feb 11, 2025 •

edited

Loading

knassre-bodo Feb 12, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025

hadia206 left a comment

hadia206 Feb 20, 2025

knassre-bodo Feb 21, 2025 •

edited

Loading

hadia206 Feb 20, 2025

knassre-bodo Feb 21, 2025 •

edited

Loading

knassre-bodo Feb 21, 2025

knassre-bodo Feb 21, 2025

review-notebook-app bot commented Feb 24, 2025

knassre-bodo Feb 24, 2025

knassre-bodo Feb 24, 2025

knassre-bodo Feb 24, 2025

knassre-bodo Feb 24, 2025

knassre-bodo Feb 24, 2025

knassre-bodo Feb 24, 2025

hadia206 left a comment

hadia206 Feb 25, 2025

knassre-bodo Feb 26, 2025

vineetg3 Feb 26, 2025 •

edited

Loading

knassre-bodo Feb 26, 2025

vineetg3 Feb 26, 2025

knassre-bodo Feb 26, 2025

		"p1": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5,
		"p2": [1] * 5 + [2] * 5 + [3] * 5 + [4] * 5 + [5] * 5,

		cus_groups.ticks.typs.original_data.WHERE(
		(shares < cus_tick_typ_avg_shares)

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Conversation

knassre-bodo commented Feb 11, 2025 • edited Loading

knassre-bodo Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadia206 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knassre-bodo Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knassre-bodo Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

review-notebook-app bot commented Feb 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadia206 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vineetg3 Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knassre-bodo commented Feb 11, 2025 •

edited

Loading

knassre-bodo Feb 12, 2025 •

edited

Loading

knassre-bodo Feb 21, 2025 •

edited

Loading

knassre-bodo Feb 21, 2025 •

edited

Loading

vineetg3 Feb 26, 2025 •

edited

Loading