-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a Jupyter Notebook guide to the PyDough documentation #253
base: main
Are you sure you want to change the base?
Conversation
Information added: Introduced a new Jupyter Notebook guide for using PyDough effectively. Walkthrough of setting up the PyDough Explained key PyDough features with code snippets and example use cases. Added step-by-step instructions for integrating PyDough with SQL-based workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic job Aramis, this notebook is filled with very useful examples! My comments are largely about clarity / textual corrections, though there are also a few cases where the PyDough code could be rearranged.
@@ -0,0 +1,3391 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure that this notebook file is linked & explained in all the places where the other 5 are?
- At the bottom of
1_introduction.ipynb
- At the bottom of the
README.md
for the demos folder
The explanation should include descriptions that indicate why someone should spend time with the notebook; what do they gain from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I recommend changing the name of the notebook slightly to align with its purpose, relative to the other 5 demo notebooks. What about 6_sql_to_pydough_examples.ipynb
? (We can keep iterating on the name if you aren't a fan of that one).
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%load_ext pydough.jupyter_extensions\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this cell just below the first cell explaining the purpose of the notebook
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"regions(): Refers to the regions collection (similar to the regions table in SQL).\n" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit out of place?
"outputs": [ | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the cells have already been run, and others have not. I recommend making it so all of the cells start out as not having been run.
"filter_c= nations(key, name, comment)\n", | ||
"\n", | ||
"pydough.to_df(filter_c)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here:
- What is
filter_c
? Can this name be changed to something that makes sense in the context of the question? - Technically the calc is unnecessary; can just do
nations
as the equivalent ofSELECT *
(the code you wrote is technically not aSELECT *
since it doesn't select theregion_key
)
"selected_customers_by_nation = nations(\n", | ||
" region_name=region.name, \n", | ||
" nation_name=name, \n", | ||
" max_order_value=MAX(customers.orders.total_price), \n", | ||
" min_order_value=MIN(customers.orders.total_price), \n", | ||
" order_value_difference=MAX(customers.orders.total_price) - MIN(customers.orders.total_price), \n", | ||
" total_orders=COUNT(customers.orders.total_price) \n", | ||
")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use variables to help out here (and don't need to do .total_price
when doing COUNT
if you want an equivalent of COUNT(*)
):
order_prices = customers.orders.total_price
smallest_order_value = MIN(order_prices)
largest_order_value = MAX(order_prices)
selected_customers_by_nation = nations(
region_name=region.name,
nation_name=name,
max_order_value=largest_order_value,
min_order_value=smallest_order_value,
order_value_difference=largest_order_value - smallest_order_value,
total_orders=COUNT(customers.orders)
)
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 12. Nations with the Most Customers in a Specific Market Segment\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one doesn't need PARTITION
either. Can rewrite as:
customer_in_target_mktsegment = customers.WHERE(ISIN(mktsegment, ('MACHINERY', 'AUTOMOBILE')))
nation_customer_count = nations(
nation_name=name,
customer_count=COUNT(customer_in_target_mktsegment)
).ORDER_BY(customer_count.DESC())
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 14. Region with the Highest Percentage of High-Priority Orders\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one also doesn't need PARTITION
since you are aggregating per-region:
high_priority_orders = ISIN(nations.orders.order_priority, ('1-URGENT', '2-HIGH'))
region_priority_summary = regions(
region_name=name,
high_priority_percentage=ROUND(
(SUM(high_priority_orders) * 100) / COUNT(nations.orders), 2
)
).ORDER_BY(high_priority_percentage.DESC())
"### 15. Customers Who Have Never Placed Orders\n", | ||
"\n", | ||
"The next situation consists in identifying customers who have not placed any orders. The query retrieves the customer details, such as their customer key and name.\n", | ||
"\n", | ||
"```SQL\n", | ||
"SELECT c.c_custkey, c.c_name\n", | ||
"FROM customer c\n", | ||
"LEFT JOIN orders o ON c.c_custkey = o.o_custkey\n", | ||
"WHERE o.o_orderkey IS NULL;\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In Pydough we use the function HAS/HASNOT to resolve the is null statement. HAS returns True if at least one record of the sub-collection exists and HASNOT returns True if at least one record of the sub-collection does'nt exists. So the steps to follow are first filtering the customers who have not placed any orders using the WHERE clause combined with the HASNOT function. This identifies the customers who have no associated orders. Then, we select these customers by retrieving their unique customer_key and customer_name. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also explain this in terms of EXISTS & NOT EXISTS (which the planner will usually turn into left-joins, under the hood):
SELECT c.c_custkey, c.c_name
FROM customer c
WHERE NOT EXISTS(
SELECT *
FROM ORDERS o
WHERE c.c_custkey = o.o_custkey
)
Under the hood, PyDough turns HAS
and HASNOT
calls into SEMI and ANTI joins respectively, which then get turned into EXISTS
and NOT EXISTS
by SQLGlot, depending on the dialect of SQL being used.
"source": [ | ||
"%%pydough\n", | ||
"\n", | ||
"# Retrieve customers by nation, classifying them into active and inactive based on order history\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PARTITION
and NDISTINCT
are both unnecessary here. Can rewrite as:
cust_info = customers(is_active=HAS(orders))
nation_customer_summary = nations(
nation_name = name,
total_customers = COUNT(cust_info),
active_customers = SUM(customers.is_active),
inactive_customers = COUNT(cust_info) - SUM(customers.is_active),
).ORDER_BY(total_customers.DESC())
If you want an example where NDISTINCT
is required, consider something like this:
# Which 10 customers have purchased the most unique parts?
customers(name, n_parts=NDISTINCT(orders.lines.part.key)).TOP_K(10, by=n_parts.DESC())
Here, we need NDISTINCT
because if we used COUNT
, we could count a single part more than once, since each customer can have multiple orders/lines, but different lines can share the same part.
If you want another example where PARTITION
is required, consider something like this:
# What is the largest amount spent by a single customer in a single year?
# Include the amount, the customer's name, and the year.
cust_year_groups = PARTITION(orders(year=YEAR(order_date)), name="o", by=(name, year))
result = cust_year_groups(name, year, total=SUM(o.total_price)).TOP_K(1, by=total)
"source": [ | ||
"This query filters customers based on specific conditions related to their account balance, order count, and geographical region. It checks if a customer’s account balance is negative, if they have made at least 5 orders, if their region is \"AMERICA,\" and if they are not from Brazil. The WHERE clause applies all these conditions using & (AND) to ensure that all must be true for a customer to be included in the results.\n", | ||
"\n", | ||
"PyDough does not yet support the AND, OR, NOT, IN expressions, as well as trying in-between comparisons like (1 < x < 5)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, it is better to say "PyDough does not support the AND, OR, NOT, IN expressions", since we don't have current plans to support them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; it would require a much more aggressive form of Python "hacking" in order to support these syntaxes, since we could not reply on magic methods to do custom overloading of the operators.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Information added:
Introduced a new Jupyter Notebook guide for using PyDough effectively. Walkthrough of setting up the PyDough
Explained key PyDough features with code snippets and example use cases. Added step-by-step instructions for integrating PyDough with SQL-based workflows.
Feedback Incorporation:
Enhanced query clarity and consistency across sections, including adding context to later queries (aligned with earlier examples).
Simplified query syntax, removing unnecessary comparisons like ==1 with HASNOT, and streamlined SUM(IFF(X, 1, 0)) to SUM(X) for better efficiency.
Removed redundant use of PARTITION by aggregating at the collection level where possible (e.g., per-nation and per-region).