Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Merged
merged 41 commits into from
Feb 26, 2025
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
096c752
started mass renaming and addition of the ancestral mapping
knassre-bodo Feb 11, 2025
19315f8
Fixing imports, allowing reuse-under-same-name uses
knassre-bodo Feb 11, 2025
580315a
Cleaning up some tests and error handling to allow non-renaming accesses
knassre-bodo Feb 11, 2025
4cb488f
Adjusting how the ancestral mapping works
knassre-bodo Feb 11, 2025
6c5ff8e
Working on purge of back reference collection
knassre-bodo Feb 12, 2025
0e71711
Begining massive purge rename of () to .CALCULATE()
knassre-bodo Feb 12, 2025
7100472
Continuing renaming purge
knassre-bodo Feb 12, 2025
ae7dacc
Continuing the back/calc rename purge
knassre-bodo Feb 12, 2025
f32de87
Continued fixing exploration, updated all_terms handling, fixing part…
knassre-bodo Feb 12, 2025
b5fa65e
Hybrid handling for new BACK semantics, need to fix partition handing…
knassre-bodo Feb 12, 2025
b80205d
Working on hybrid partition cases, all queries working except q11 wit…
knassre-bodo Feb 13, 2025
42acc02
Fixing extreme edge case bug
knassre-bodo Feb 13, 2025
aee583f
Resolving merge conflicts
knassre-bodo Feb 13, 2025
4307846
Fixing defog functions
knassre-bodo Feb 13, 2025
2942171
Cleanup of correlate avoiding case [RUN CI]
knassre-bodo Feb 13, 2025
9aa294a
updated core spec docs, need to finish updating notebooks to purge ol…
knassre-bodo Feb 13, 2025
1beb081
[RUN CI]
knassre-bodo Feb 13, 2025
ddd8d3b
Added to_sql and to_df keyword argument for columns
knassre-bodo Feb 13, 2025
c2084d0
Testing to_sql with columns arg
knassre-bodo Feb 13, 2025
fee822f
Added to_df tests
knassre-bodo Feb 13, 2025
09372ab
Updating usage doc [RUN CI]
knassre-bodo Feb 13, 2025
9a867fb
Revising notebooks [RUN CI]
knassre-bodo Feb 13, 2025
b465a34
Fixing typo
knassre-bodo Feb 13, 2025
20bfd0b
more doc fixes
knassre-bodo Feb 13, 2025
12e85e9
Resolving massive conflits after pulling in big changes from main, ne…
knassre-bodo Feb 19, 2025
07493bb
Fixed most of the correlation bugs
knassre-bodo Feb 19, 2025
afb332b
Fixed partition as child bugs with backreferences as keys, need to de…
knassre-bodo Feb 19, 2025
a4f5ed8
Fixed partition as child bugs with backreferences as keys, need to de…
knassre-bodo Feb 19, 2025
c627001
Fixing remaining correlation issues [RUN CI]
knassre-bodo Feb 19, 2025
6fc42b3
Added Hadia's initial quick fixes
knassre-bodo Feb 19, 2025
8b0aff9
Getting remaining tests online [RUN CI]
knassre-bodo Feb 20, 2025
a8e604c
Merge remote-tracking branch 'origin/kian/back_overhaul' into kian/ba…
knassre-bodo Feb 20, 2025
ac26ece
Further revisions
knassre-bodo Feb 21, 2025
588501e
Update pydough/conversion/relational_converter.py
knassre-bodo Feb 21, 2025
4b81f1b
Making agg join keys deterministically sorted
knassre-bodo Feb 21, 2025
68f614c
Merge remote-tracking branch 'origin/kian/back_overhaul' into kian/ba…
knassre-bodo Feb 21, 2025
2502a97
Overhaul to how partition child handles backrefs to reconcile convolu…
knassre-bodo Feb 24, 2025
764e134
Added one more extreme edge case for the partition child behavior [RU…
knassre-bodo Feb 24, 2025
8c155e5
Resolving conflicts [RUN CI]
knassre-bodo Feb 24, 2025
2fe4d4d
Revising notebooks [RUN CI]
knassre-bodo Feb 26, 2025
8c70a13
Revisions [RUN CI]
knassre-bodo Feb 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ venv/
ENV/
env.bak/
venv.bak/
.vscode

# Spyder project settings
.spyderproject
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Suppose I want to know for every person their name & the total income they've ma
The following PyDough snippet solves this problem:

```py
result = People(
result = People.CALCULATE(
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure all of these got updated, in all of our documentation, notebooks, tests, etc.

name,
net_income = SUM(jobs.income_earned) - SUM(schools.tuition_paid)
)
Expand Down
16 changes: 10 additions & 6 deletions demos/notebooks/1_introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@
"source": [
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least 1 reviewer should re-run all 5 notebooks to confirm they behave as expected.

"%%pydough\n",
"\n",
"nations(key, name)"
"nations.CALCULATE(nkey=key, nname=name)"
]
},
{
Expand All @@ -121,7 +121,7 @@
"source": [
"%%pydough\n",
"\n",
"nation_keys = nations(key, name)"
"nation_keys = nations.CALCULATE(nkey=key, nname=name)"
]
},
{
Expand Down Expand Up @@ -149,7 +149,7 @@
"source": [
"%%pydough\n",
"\n",
"lowest_customer_nations = nation_keys(key, name, cust_count=COUNT(customers)).TOP_K(2, by=cust_count.ASC())\n",
"lowest_customer_nations = nation_keys.CALCULATE(nkey, nname, cust_count=COUNT(customers)).TOP_K(2, by=cust_count.ASC())\n",
"lowest_customer_nations"
]
},
Expand Down Expand Up @@ -236,7 +236,9 @@
"id": "f52dfcfe-6e90-44b8-b9c4-7dc08a5b28ca",
"metadata": {},
"source": [
"Finally, while building a statement from smaller components is best practice in Pydough, you can always evaluate the entire expression all at once within a PyDough cell, such as this example that loads the all asian nations in the dataset."
"Finally, while building a statement from smaller components is best practice in Pydough, you can always evaluate the entire expression all at once within a PyDough cell, such as this example that loads the all Asian nations in the dataset.\n",
"\n",
"We can use the optional `columns` argument to `to_sql` or `to_df` to specify which columns to include, or even what they should be renamed as."
]
},
{
Expand All @@ -248,7 +250,9 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(nations.WHERE(region.name == \"ASIA\"))"
"asian_countries = nations.WHERE(region.name == \"ASIA\")\n",
"print(pydough.to_df(asian_countries, columns=[\"name\", \"key\"]))\n",
"pydough.to_df(asian_countries, columns={\"nation_name\": \"name\", \"id\": \"key\"})"
]
},
{
Expand Down Expand Up @@ -290,7 +294,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
"version": "3.12.6"
}
},
"nbformat": 4,
Expand Down
85 changes: 50 additions & 35 deletions demos/notebooks/2_pydough_operations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,9 @@
"id": "a25a2965-4f88-4626-b326-caf931fdba9c",
"metadata": {},
"source": [
"## Calc\n",
"## Calculate\n",
"\n",
"The next important operation is the `CALC` operation, which is used by \"calling\" a collection as a function."
"The next important operation is the `CALCULATE` operation, which takes in a variable number of positioning and/or keyword arguments."
]
},
{
Expand All @@ -98,7 +98,7 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_sql(nations(key))"
"print(pydough.to_sql(nations.CALCULATE(key, nation_name=name)))"
]
},
{
Expand All @@ -109,7 +109,10 @@
"Calc has a few purposes:\n",
"* Select which entries you want in the output.\n",
"* Define new fields by calling functions.\n",
"* Allow operations to be evaluated for each entry in the outermost collection's \"context\"."
"* Allow operations to be evaluated for each entry in the outermost collection's \"context\".\n",
"* Define aliases for terms that get down-streamed to descendants ([see here](#down-streaming)).\n",
"\n",
"The terms of the last `CALCULATE` in the PyDough logic are the terms that are included in the result (unless the `columns` argument of `to_sql` or `to_df` is used)."
]
},
{
Expand All @@ -121,15 +124,15 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_sql(nations(key + 1))"
"print(pydough.to_sql(nations.CALCULATE(adjusted_key = key + 1)))"
]
},
{
"cell_type": "markdown",
"id": "24031aa2-1df7-441d-b487-aa093b852504",
"metadata": {},
"source": [
"Here the context is the \"nations\" at the root of the graph. This means that for each entry within nations, we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region, we can do via CALC."
"Here the context is the \"nations\" at the root of the graph. This means that for each entry within nations, we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region, we can do via `CALCULATE`."
]
},
{
Expand All @@ -141,7 +144,7 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(regions(name, nation_count=COUNT(nations)))"
"pydough.to_df(regions.CALCULATE(name, nation_count=COUNT(nations)))"
]
},
{
Expand All @@ -151,7 +154,7 @@
"source": [
"Internally, this process evaluates `COUNT(nations)` grouped on each region and then joining the result with the original `regions` table. Importantly, this outputs a \"scalar\" value for each region.\n",
"\n",
"This shows a very important restriction of CALC, each final entry in a calc expression must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single nation name for each region. \n",
"This shows a very important restriction of `CALCULATE`, each final entry in a calc expression must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single nation name for each region. \n",
"\n",
"**The cell below will result in an error because it violates this restriction.**"
]
Expand All @@ -165,7 +168,7 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(regions(region_name=name, nation_name=nations.name))"
"pydough.to_df(regions.CALCULATE(region_name=name, nation_name=nations.name))"
]
},
{
Expand All @@ -185,7 +188,7 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(nations(nation_name=name, region_name=region.name))"
"pydough.to_df(nations.CALCULATE(nation_name=name, region_name=region.name))"
]
},
{
Expand Down Expand Up @@ -216,29 +219,39 @@
"%%pydough\n",
"\n",
"# Numeric operations\n",
"print(pydough.to_sql(nations(key + 1, key - 1, key * 1, key / 1)))\n",
"print(\"Q1\")\n",
"print(pydough.to_sql(nations.CALCULATE(key + 1, key - 1, key * 1, key / 1)))\n",
"\n",
"# Comparison operators\n",
"print(pydough.to_sql(nations(key == 0, key < 0, key != 0, key >= 5)))\n",
"print(\"\\nQ2\")\n",
"print(pydough.to_sql(nations.CALCULATE(key == 0, key < 0, key != 0, key >= 5)))\n",
"\n",
"# String Operations\n",
"print(pydough.to_sql(nations(LENGTH(name), UPPER(name), LOWER(name), STARTSWITH(name, \"A\"))))\n",
"print(\"\\nQ3\")\n",
"print(pydough.to_sql(nations.CALCULATE(LENGTH(name), UPPER(name), LOWER(name), STARTSWITH(name, \"A\"))))\n",
"\n",
"# Boolean operations\n",
"print(pydough.to_sql(nations((key != 1) & (LENGTH(name) > 5)))) # Boolean AND\n",
"print(pydough.to_sql(nations((key != 1) | (LENGTH(name) > 5)))) # Boolean OR\n",
"print(pydough.to_sql(nations(~(LENGTH(name) > 5)))) # Boolean NOT \n",
"print(pydough.to_sql(nations(ISIN(name, (\"KENYA\", \"JAPAN\"))))) # In\n",
"print(\"\\nQ4\")\n",
"print(pydough.to_sql(nations.CALCULATE((key != 1) & (LENGTH(name) > 5)))) # Boolean AND\n",
"print(\"\\nQ5\")\n",
"print(pydough.to_sql(nations.CALCULATE((key != 1) | (LENGTH(name) > 5)))) # Boolean OR\n",
"print(\"\\nQ6\")\n",
"print(pydough.to_sql(nations.CALCULATE(~(LENGTH(name) > 5)))) # Boolean NOT \n",
"print(\"\\nQ7\") \n",
"print(pydough.to_sql(nations.CALCULATE(ISIN(name, (\"KENYA\", \"JAPAN\"))))) # In\n",
"\n",
"# Datetime Operations\n",
"# Note: Since this is based on SQL lite the underlying date is a bit strange.\n",
"print(pydough.to_sql(lines(YEAR(ship_date), MONTH(ship_date), DAY(ship_date),HOUR(ship_date),MINUTE(ship_date),SECOND(ship_date))))\n",
"print(\"\\nQ8\")\n",
"print(pydough.to_sql(lines.CALCULATE(YEAR(ship_date), MONTH(ship_date), DAY(ship_date),HOUR(ship_date),MINUTE(ship_date),SECOND(ship_date))))\n",
"\n",
"# Aggregation operations\n",
"print(pydough.to_sql(TPCH(NDISTINCT(nations.comment), SUM(nations.key))))\n",
"print(\"\\nQ9\")\n",
"print(pydough.to_sql(TPCH.CALCULATE(NDISTINCT(nations.comment), SUM(nations.key))))\n",
"# Count can be used on a column for non-null entries or a collection\n",
"# for total entries.\n",
"print(pydough.to_sql(TPCH(COUNT(nations), COUNT(nations.comment))))"
"print(\"\\nQ10\")\n",
"print(pydough.to_sql(TPCH.CALCULATE(COUNT(nations), COUNT(nations.comment))))"
]
},
{
Expand All @@ -260,9 +273,11 @@
"id": "b70993e8-3cd2-4c45-87e3-8e68f67b92a0",
"metadata": {},
"source": [
"### BACK\n",
"### Down-Streaming\n",
"\n",
"Sometimes you need to load a value from a previous context to use at a later step in a PyDough statement. Any expression from an ancestor context that is placed in a `CALCULATE` is automatically made available to all descendants of that context. However, an error will occur if the name of the term defined in the ancestor collides with a name of a term or property of a descendant context, since PyDough will not know which one to use.\n",
"\n",
"Sometimes you need to load a value from a previous context to use at a later step in a PyDough statement. That can be done using the `BACK` operation. This step moves back `k` steps to find the name you are searching for. This is useful to avoid repeating computation."
"Notice how in the example below, `region_name` is defined in a `CALCULATE` within the context of `regions`, so the calculate within the context of `nations` also has access to `region_name` (interpreted as \"the name of the region that this nation belongs to\")."
]
},
{
Expand All @@ -274,15 +289,15 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(regions.nations(region_name=BACK(1).name, nation_name=name))"
"pydough.to_df(regions.CALCULATE(region_name=name).nations.CALCULATE(region_name, nation_name=name))"
]
},
{
"cell_type": "markdown",
"id": "6040a7c5-fc82-4e33-8b2b-a1b3ef394f71",
"metadata": {},
"source": [
"Here is a more complex example showing intermediate values. Here we will first compute `total_value` and then reuse it via `BACK`."
"Here is a more complex example showing intermediate values. Here we will first compute `total_value` and then reuse it downstream."
]
},
{
Expand All @@ -294,7 +309,7 @@
"source": [
"%%pydough\n",
"\n",
"nations_value = nations(name, total_value=SUM(suppliers.account_balance))\n",
"nations_value = nations.CALCULATE(nation_name=name, total_value=SUM(suppliers.account_balance))\n",
"pydough.to_df(nations_value)"
]
},
Expand All @@ -306,12 +321,12 @@
"outputs": [],
"source": [
"%%pydough\n",
"suppliers_value = nations_value.suppliers(\n",
"suppliers_value = nations_value.suppliers.CALCULATE(\n",
" key,\n",
" name,\n",
" nation_name=BACK(1).name,\n",
" nation_name,\n",
" account_balance=account_balance,\n",
" percentage_of_national_value=100 * account_balance / BACK(1).total_value\n",
" percentage_of_national_value=100 * account_balance / total_value\n",
")\n",
"top_suppliers = suppliers_value.TOP_K(20, by=percentage_of_national_value.DESC())\n",
"pydough.to_df(top_suppliers)"
Expand All @@ -324,7 +339,7 @@
"source": [
"## WHERE\n",
"\n",
"The WHERE operation by be used to filter unwanted entries in a context. For example, we can filter `nations` to only consider the `AMERICA` and `EUROPE` regions. A WHERE's context functions similarly to a calc except that it cannot be used to assign new properties. "
"The `WHERE` operation by be used to filter unwanted entries in a context. For example, we can filter `nations` to only consider the `AMERICA` and `EUROPE` regions. A WHERE's context functions similarly to a `CALCULATE` except that it cannot be used to assign new properties; it only contains a single positional argument: the predicate to filter on. "
]
},
{
Expand Down Expand Up @@ -428,10 +443,10 @@
"source": [
"%%pydough\n",
"\n",
"updated_nations = nations(key, name_length=LENGTH(name))\n",
"updated_nations = nations.CALCULATE(key, name_length=LENGTH(name))\n",
"grouped_nations = PARTITION(\n",
" updated_nations, name=\"n\", by=(name_length)\n",
")(\n",
").CALCULATE(\n",
" name_length,\n",
" nation_count=COUNT(n.key)\n",
")\n",
Expand All @@ -446,7 +461,7 @@
"A couple important usage details:\n",
"* The `name` argument specifies the name of the subcollection access from the partitions to the original unpartitioned data.\n",
"* `keys` can be either be a single expression or a tuple of them, but it can only be references to expressions that already exist in the context of the data (e.g. `name`, not `LOWER(name)` or `region.name`)\n",
"* `BACK` should be used to step back into the partition child without retaining the partitioning. An example is shown below where we select brass european parts but only with the minimum supply cost."
"* Terms defined from the context of the `PARTITION` can be down-streamed to its descendants. An example is shown below where we select brass parts of size 15, but only the ones whose supply is below the average of all such parts."
]
},
{
Expand All @@ -459,8 +474,8 @@
"%%pydough\n",
"\n",
"selected_parts = parts.WHERE(ENDSWITH(part_type, \"BRASS\") & (size == 15))\n",
"part_types = PARTITION(selected_parts, name=\"p\", by=part_type)(avg_price=AVG(p.retail_price))\n",
"output = part_types.p.WHERE(retail_price < BACK(1).avg_price)\n",
"part_types = PARTITION(selected_parts, name=\"p\", by=part_type).CALCULATE(avg_price=AVG(p.retail_price))\n",
"output = part_types.p.WHERE(retail_price < avg_price)\n",
"pydough.to_df(output)"
]
},
Expand Down Expand Up @@ -532,7 +547,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
"version": "3.12.6"
}
},
"nbformat": 4,
Expand Down
6 changes: 3 additions & 3 deletions demos/notebooks/3_exploration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@
"\n",
"orders_1995 = customers.orders.WHERE(YEAR(order_date) == 1995)\n",
"\n",
"asian_countries_info = asian_countries(country_name=LOWER(name), total_orders=COUNT(orders_1995))\n",
"asian_countries_info = asian_countries.CALCULATE(country_name=LOWER(name), total_orders=COUNT(orders_1995))\n",
"\n",
"top_asian_countries = asian_countries_info.TOP_K(3, by=total_orders.DESC())\n",
"\n",
Expand Down Expand Up @@ -408,7 +408,7 @@
"source": [
"Here, we learn that `customers.orders` invokes a child of the current context (`nations.WHERE(region.name == 'ASIA')`) by accessing the `customers` subcollection, then accessing its `orders` collection, then filtering it on the conedition `YEAR(order_date) == 1995`. \n",
"\n",
"We also know that this resulting child is plural with regards to the context, meaning that `asian_countries(asian_countries.order_date)` would be illegal, but `asian_countries(MAX(asian_countries.order_date))` is legal.\n",
"We also know that this resulting child is plural with regards to the context, meaning that `asian_countries.CALCULATE(asian_countries.order_date)` would be illegal, but `asian_countries.CALCULATE(MAX(asian_countries.order_date))` is legal.\n",
"\n",
"More combinations of `pydough.explain` and `pydough.explain_terms` can be done to learn more about what each of these components does."
]
Expand Down Expand Up @@ -438,7 +438,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
"version": "3.12.6"
}
},
"nbformat": 4,
Expand Down
Loading