Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a Jupyter Notebook guide to the PyDough documentation #253

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aramirez-bodo
Copy link

Information added:

Introduced a new Jupyter Notebook guide for using PyDough effectively. Walkthrough of setting up the PyDough
Explained key PyDough features with code snippets and example use cases. Added step-by-step instructions for integrating PyDough with SQL-based workflows.

Feedback Incorporation:
Enhanced query clarity and consistency across sections, including adding context to later queries (aligned with earlier examples).
Simplified query syntax, removing unnecessary comparisons like ==1 with HASNOT, and streamlined SUM(IFF(X, 1, 0)) to SUM(X) for better efficiency.
Removed redundant use of PARTITION by aggregating at the collection level where possible (e.g., per-nation and per-region).

Information added:

Introduced a new Jupyter Notebook guide for using PyDough effectively.
Walkthrough of setting up the PyDough
Explained key PyDough features with code snippets and example use cases.
Added step-by-step instructions for integrating PyDough with SQL-based workflows.
Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic job Aramis, this notebook is filled with very useful examples! My comments are largely about clarity / textual corrections, though there are also a few cases where the PyDough code could be rearranged.

@@ -0,0 +1,3391 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure that this notebook file is linked & explained in all the places where the other 5 are?

  • At the bottom of 1_introduction.ipynb
  • At the bottom of the README.md for the demos folder

The explanation should include descriptions that indicate why someone should spend time with the notebook; what do they gain from it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I recommend changing the name of the notebook slightly to align with its purpose, relative to the other 5 demo notebooks. What about 6_sql_to_pydough_examples.ipynb? (We can keep iterating on the name if you aren't a fan of that one).

"metadata": {},
"outputs": [],
"source": [
"%load_ext pydough.jupyter_extensions\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this cell just below the first cell explaining the purpose of the notebook

Comment on lines +279 to +285
{
"cell_type": "markdown",
"metadata": {},
"source": [
"regions(): Refers to the regions collection (similar to the regions table in SQL).\n"
]
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit out of place?

Comment on lines +356 to +357
"outputs": [
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the cells have already been run, and others have not. I recommend making it so all of the cells start out as not having been run.

Comment on lines +274 to +276
"filter_c= nations(key, name, comment)\n",
"\n",
"pydough.to_df(filter_c)\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  • What is filter_c? Can this name be changed to something that makes sense in the context of the question?
  • Technically the calc is unnecessary; can just do nations as the equivalent of SELECT * (the code you wrote is technically not a SELECT * since it doesn't select the region_key)

Comment on lines +2340 to +2347
"selected_customers_by_nation = nations(\n",
" region_name=region.name, \n",
" nation_name=name, \n",
" max_order_value=MAX(customers.orders.total_price), \n",
" min_order_value=MIN(customers.orders.total_price), \n",
" order_value_difference=MAX(customers.orders.total_price) - MIN(customers.orders.total_price), \n",
" total_orders=COUNT(customers.orders.total_price) \n",
")\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use variables to help out here (and don't need to do .total_price when doing COUNT if you want an equivalent of COUNT(*)):

order_prices = customers.orders.total_price
smallest_order_value = MIN(order_prices)
largest_order_value = MAX(order_prices)
selected_customers_by_nation = nations(
    region_name=region.name,  
    nation_name=name,  
    max_order_value=largest_order_value,
    min_order_value=smallest_order_value,
    order_value_difference=largest_order_value - smallest_order_value,
    total_orders=COUNT(customers.orders)  
)

"cell_type": "markdown",
"metadata": {},
"source": [
"### 12. Nations with the Most Customers in a Specific Market Segment\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one doesn't need PARTITION either. Can rewrite as:

customer_in_target_mktsegment = customers.WHERE(ISIN(mktsegment, ('MACHINERY', 'AUTOMOBILE')))
nation_customer_count = nations(
    nation_name=name, 
    customer_count=COUNT(customer_in_target_mktsegment)  
).ORDER_BY(customer_count.DESC())  

"cell_type": "markdown",
"metadata": {},
"source": [
"### 14. Region with the Highest Percentage of High-Priority Orders\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also doesn't need PARTITION since you are aggregating per-region:

high_priority_orders = ISIN(nations.orders.order_priority, ('1-URGENT', '2-HIGH'))
region_priority_summary = regions(
    region_name=name, 
    high_priority_percentage=ROUND(
        (SUM(high_priority_orders) * 100) / COUNT(nations.orders), 2
    ) 
).ORDER_BY(high_priority_percentage.DESC())

Comment on lines +2735 to +2750
"### 15. Customers Who Have Never Placed Orders\n",
"\n",
"The next situation consists in identifying customers who have not placed any orders. The query retrieves the customer details, such as their customer key and name.\n",
"\n",
"```SQL\n",
"SELECT c.c_custkey, c.c_name\n",
"FROM customer c\n",
"LEFT JOIN orders o ON c.c_custkey = o.o_custkey\n",
"WHERE o.o_orderkey IS NULL;\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Pydough we use the function HAS/HASNOT to resolve the is null statement. HAS returns True if at least one record of the sub-collection exists and HASNOT returns True if at least one record of the sub-collection does'nt exists. So the steps to follow are first filtering the customers who have not placed any orders using the WHERE clause combined with the HASNOT function. This identifies the customers who have no associated orders. Then, we select these customers by retrieving their unique customer_key and customer_name. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also explain this in terms of EXISTS & NOT EXISTS (which the planner will usually turn into left-joins, under the hood):

SELECT c.c_custkey, c.c_name
FROM customer c
WHERE NOT EXISTS(
  SELECT *
  FROM ORDERS o
  WHERE c.c_custkey = o.o_custkey
)

Under the hood, PyDough turns HAS and HASNOT calls into SEMI and ANTI joins respectively, which then get turned into EXISTS and NOT EXISTS by SQLGlot, depending on the dialect of SQL being used.

"source": [
"%%pydough\n",
"\n",
"# Retrieve customers by nation, classifying them into active and inactive based on order history\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PARTITION and NDISTINCT are both unnecessary here. Can rewrite as:

cust_info = customers(is_active=HAS(orders))
nation_customer_summary = nations(
  nation_name = name,
  total_customers = COUNT(cust_info),
  active_customers = SUM(customers.is_active),
  inactive_customers = COUNT(cust_info) - SUM(customers.is_active),
).ORDER_BY(total_customers.DESC())

If you want an example where NDISTINCT is required, consider something like this:

# Which 10 customers have purchased the most unique parts?
customers(name, n_parts=NDISTINCT(orders.lines.part.key)).TOP_K(10, by=n_parts.DESC())

Here, we need NDISTINCT because if we used COUNT, we could count a single part more than once, since each customer can have multiple orders/lines, but different lines can share the same part.

If you want another example where PARTITION is required, consider something like this:

# What is the largest amount spent by a single customer in a single year?
# Include the amount, the customer's name, and the year.
cust_year_groups = PARTITION(orders(year=YEAR(order_date)), name="o", by=(name, year))
result = cust_year_groups(name, year, total=SUM(o.total_price)).TOP_K(1, by=total)

"source": [
"This query filters customers based on specific conditions related to their account balance, order count, and geographical region. It checks if a customer’s account balance is negative, if they have made at least 5 orders, if their region is \"AMERICA,\" and if they are not from Brazil. The WHERE clause applies all these conditions using & (AND) to ensure that all must be true for a customer to be included in the results.\n",
"\n",
"PyDough does not yet support the AND, OR, NOT, IN expressions, as well as trying in-between comparisons like (1 < x < 5)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it is better to say "PyDough does not support the AND, OR, NOT, IN expressions", since we don't have current plans to support them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; it would require a much more aggressive form of Python "hacking" in order to support these syntaxes, since we could not reply on magic methods to do custom overloading of the operators.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants