New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Adding a Jupyter Notebook guide to the PyDough documentation #253

Open

aramirez-bodo wants to merge 1 commit into main from aramis/pydough_notebook

+3,391 −0

aramirez-bodo commented Feb 7, 2025

Information added:

Introduced a new Jupyter Notebook guide for using PyDough effectively. Walkthrough of setting up the PyDough
Explained key PyDough features with code snippets and example use cases. Added step-by-step instructions for integrating PyDough with SQL-based workflows.

Feedback Incorporation:
Enhanced query clarity and consistency across sections, including adding context to later queries (aligned with earlier examples).
Simplified query syntax, removing unnecessary comparisons like ==1 with HASNOT, and streamlined SUM(IFF(X, 1, 0)) to SUM(X) for better efficiency.
Removed redundant use of PARTITION by aggregating at the collection level where possible (e.g., per-nation and per-region).


          Adding a Jupyter Notebook guide to the PyDough documentation

619de66

Information added:

Introduced a new Jupyter Notebook guide for using PyDough effectively.
Walkthrough of setting up the PyDough
Explained key PyDough features with code snippets and example use cases.
Added step-by-step instructions for integrating PyDough with SQL-based workflows.

aramirez-bodo assigned knassre-bodo

aramirez-bodo requested a review from knassre-bodo

February 7, 2025 21:38

aramirez-bodo unassigned knassre-bodo

knassre-bodo reviewed

View reviewed changes

Contributor

knassre-bodo left a comment

Fantastic job Aramis, this notebook is filled with very useful examples! My comments are largely about clarity / textual corrections, though there are also a few cases where the PyDough code could be rearranged.

demos/notebooks/6_pydough_TCPH_guide.ipynb

    
            @@ -0,0 +1,3391 @@
          
              {

Contributor

knassre-bodo Feb 7, 2025

Can you make sure that this notebook file is linked & explained in all the places where the other 5 are?

At the bottom of 1_introduction.ipynb
At the bottom of the README.md for the demos folder

The explanation should include descriptions that indicate why someone should spend time with the notebook; what do they gain from it?

Contributor

knassre-bodo Feb 7, 2025

Also, I recommend changing the name of the notebook slightly to align with its purpose, relative to the other 5 demo notebooks. What about 6_sql_to_pydough_examples.ipynb? (We can keep iterating on the name if you aren't a fan of that one).

demos/notebooks/6_pydough_TCPH_guide.ipynb

    
                 "metadata": {},

                 "outputs": [],

                 "source": [

                  "%load_ext pydough.jupyter_extensions\n",

Contributor

knassre-bodo Feb 7, 2025

Move this cell just below the first cell explaining the purpose of the notebook

demos/notebooks/6_pydough_TCPH_guide.ipynb

Comment on lines +279 to +285

    
                {

                 "cell_type": "markdown",

                 "metadata": {},

                 "source": [

                  "regions(): Refers to the regions collection (similar to the regions table in SQL).\n"

                 ]

                },

Contributor

knassre-bodo Feb 7, 2025

This is a bit out of place?

demos/notebooks/6_pydough_TCPH_guide.ipynb

Comment on lines +356 to +357

    
                 "outputs": [

                  {

Contributor

knassre-bodo Feb 7, 2025

Some of the cells have already been run, and others have not. I recommend making it so all of the cells start out as not having been run.

demos/notebooks/6_pydough_TCPH_guide.ipynb

Comment on lines +274 to +276

    
                  "filter_c= nations(key, name, comment)\n",

                  "\n",

                  "pydough.to_df(filter_c)\n"

Contributor

knassre-bodo Feb 7, 2025

Two things here:

What is filter_c? Can this name be changed to something that makes sense in the context of the question?
Technically the calc is unnecessary; can just do nations as the equivalent of SELECT * (the code you wrote is technically not a SELECT * since it doesn't select the region_key)

demos/notebooks/6_pydough_TCPH_guide.ipynb

Comment on lines +2340 to +2347

    
                  "selected_customers_by_nation = nations(\n",

                  "    region_name=region.name,  \n",

                  "    nation_name=name,  \n",

                  "    max_order_value=MAX(customers.orders.total_price), \n",

                  "    min_order_value=MIN(customers.orders.total_price),  \n",

                  "    order_value_difference=MAX(customers.orders.total_price) - MIN(customers.orders.total_price), \n",

                  "    total_orders=COUNT(customers.orders.total_price)  \n",

                  ")\n",

Contributor

knassre-bodo Feb 7, 2025

Can use variables to help out here (and don't need to do .total_price when doing COUNT if you want an equivalent of COUNT(*)):

order_prices = customers.orders.total_price
smallest_order_value = MIN(order_prices)
largest_order_value = MAX(order_prices)
selected_customers_by_nation = nations(
    region_name=region.name,  
    nation_name=name,  
    max_order_value=largest_order_value,
    min_order_value=smallest_order_value,
    order_value_difference=largest_order_value - smallest_order_value,
    total_orders=COUNT(customers.orders)  
)

demos/notebooks/6_pydough_TCPH_guide.ipynb

    
                 "cell_type": "markdown",

                 "metadata": {},

                 "source": [

                  "### 12. Nations with the Most Customers in a Specific Market Segment\n",

Contributor

knassre-bodo Feb 7, 2025

This one doesn't need PARTITION either. Can rewrite as:

customer_in_target_mktsegment = customers.WHERE(ISIN(mktsegment, ('MACHINERY', 'AUTOMOBILE')))
nation_customer_count = nations(
    nation_name=name, 
    customer_count=COUNT(customer_in_target_mktsegment)  
).ORDER_BY(customer_count.DESC())

demos/notebooks/6_pydough_TCPH_guide.ipynb

    
                 "cell_type": "markdown",

                 "metadata": {},

                 "source": [

                  "### 14. Region with the Highest Percentage of High-Priority Orders\n",

Contributor

knassre-bodo Feb 7, 2025

This one also doesn't need PARTITION since you are aggregating per-region:

high_priority_orders = ISIN(nations.orders.order_priority, ('1-URGENT', '2-HIGH'))
region_priority_summary = regions(
    region_name=name, 
    high_priority_percentage=ROUND(
        (SUM(high_priority_orders) * 100) / COUNT(nations.orders), 2
    ) 
).ORDER_BY(high_priority_percentage.DESC())

demos/notebooks/6_pydough_TCPH_guide.ipynb

Comment on lines +2735 to +2750

    
                  "### 15. Customers Who Have Never Placed Orders\n",

                  "\n",

                  "The next situation consists in identifying customers who have not placed any orders. The query retrieves the customer details, such as their customer key and name.\n",

                  "\n",

                  "```SQL\n",

                  "SELECT c.c_custkey, c.c_name\n",

                  "FROM customer c\n",

                  "LEFT JOIN orders o ON c.c_custkey = o.o_custkey\n",

                  "WHERE o.o_orderkey IS NULL;\n"

                 ]

                },

                {

                 "cell_type": "markdown",

                 "metadata": {},

                 "source": [

                  "In Pydough we use the function HAS/HASNOT to resolve the is null statement. HAS returns True if at least one record of the sub-collection exists and HASNOT returns True if at least one record of the sub-collection does'nt exists. So the steps to follow are first filtering the customers who have not placed any orders using the WHERE clause combined with the HASNOT function. This identifies the customers who have no associated orders. Then, we select these customers by retrieving their unique customer_key and customer_name. "

Contributor

knassre-bodo Feb 7, 2025

You can also explain this in terms of EXISTS & NOT EXISTS (which the planner will usually turn into left-joins, under the hood):

SELECT c.c_custkey, c.c_name
FROM customer c
WHERE NOT EXISTS(
  SELECT *
  FROM ORDERS o
  WHERE c.c_custkey = o.o_custkey
)

Under the hood, PyDough turns HAS and HASNOT calls into SEMI and ANTI joins respectively, which then get turned into EXISTS and NOT EXISTS by SQLGlot, depending on the dialect of SQL being used.

demos/notebooks/6_pydough_TCPH_guide.ipynb

    
                 "source": [

                  "%%pydough\n",

                  "\n",

                  "# Retrieve customers by nation, classifying them into active and inactive based on order history\n",

Contributor

knassre-bodo Feb 7, 2025

The PARTITION and NDISTINCT are both unnecessary here. Can rewrite as:

cust_info = customers(is_active=HAS(orders))
nation_customer_summary = nations(
  nation_name = name,
  total_customers = COUNT(cust_info),
  active_customers = SUM(customers.is_active),
  inactive_customers = COUNT(cust_info) - SUM(customers.is_active),
).ORDER_BY(total_customers.DESC())

If you want an example where NDISTINCT is required, consider something like this:

# Which 10 customers have purchased the most unique parts?
customers(name, n_parts=NDISTINCT(orders.lines.part.key)).TOP_K(10, by=n_parts.DESC())

Here, we need NDISTINCT because if we used COUNT, we could count a single part more than once, since each customer can have multiple orders/lines, but different lines can share the same part.

If you want another example where PARTITION is required, consider something like this:

# What is the largest amount spent by a single customer in a single year?
# Include the amount, the customer's name, and the year.
cust_year_groups = PARTITION(orders(year=YEAR(order_date)), name="o", by=(name, year))
result = cust_year_groups(name, year, total=SUM(o.total_price)).TOP_K(1, by=total)

knassre-bodo requested review from hadia206 and vineetg3

February 8, 2025 22:37

vineetg3 reviewed

View reviewed changes

demos/notebooks/6_pydough_TCPH_guide.ipynb

    
                 "source": [

                  "This query filters customers based on specific conditions related to their account balance, order count, and geographical region. It checks if a customer’s account balance is negative, if they have made at least 5 orders, if their region is \"AMERICA,\" and if they are not from Brazil. The WHERE clause applies all these conditions using & (AND) to ensure that all must be true for a customer to be included in the results.\n",

                  "\n",

                  "PyDough does not yet support the AND, OR, NOT, IN expressions, as well as trying in-between comparisons like (1 < x < 5)"

Contributor

vineetg3 Feb 11, 2025

For now, it is better to say "PyDough does not support the AND, OR, NOT, IN expressions", since we don't have current plans to support them.

Contributor

knassre-bodo Feb 13, 2025

Agreed; it would require a much more aggressive form of Python "hacking" in order to support these syntaxes, since we could not reply on magic methods to do custom overloading of the operators.

review-notebook-app bot commented Feb 24, 2025

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet