Skip to content

Commit

Permalink
Merge branch 'kian/corrleated_backref' into kian/correlated_backref_2
Browse files Browse the repository at this point in the history
  • Loading branch information
knassre-bodo committed Feb 6, 2025
2 parents 8184a7e + 746eed5 commit 4818c59
Show file tree
Hide file tree
Showing 34 changed files with 931 additions and 57 deletions.
96 changes: 83 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,101 @@

PyDough is an alternative DSL that can be used to solve analytical problems by phrasing questions in terms of a logical document model instead of translating to relational SQL logic.

## What Is PyDough

PyDough allows expressing analytical questions with hierarchical thinking, as seen in models such as [MongoDB](https://www.mongodb.com/docs/manual/data-modeling/), since that mental model is closer to human linguistics than a relational model.
Unlike MongoDB, PyDough only uses a logical document model for abstractly explaining & interacting with data, rather than a physical document model to store the data.
PyDough code can be written in and interleaved with Python code, and practices a lazy evaluation scheme that does not qualify or execute any logic until requested.
PyDough executes by translating its logic into SQL which it can directly executing in an arbitrary database.

Consider the following information represented by the tables in a database:
- There are people; each person has a name, ssn, birth date, records of jobs they have had, and records of schools they have attended.
- There are employment records; each job record has the ssn of the person being employed, the name of the company, and the total income they made from the job.
- There are education records; each education record has the ssn of the person attending the school, the name of the school, and the total tuition they paid to that school.

Suppose I want to know for every person their name & the total income they've made from all jobs minus the total tuition paid to all schools. However, I want to include people who have never had a job or never attended any schools, and I need to account for people who could have had multiple jobs or attended multiple schools.
The following PyDough snippet solves this problem:

```py
result = People(
name,
net_income = SUM(jobs.income_earned) - SUM(schools.tuition_paid)
)
pydough.to_df(result)
```

However, if answering the question with SQL, I would need to write the following less-intuitive SQL query:

```sql
SELECT
P.name AS name,
COALESCE(T1.total_income_earned, 0) - COALESCE(T2.total_tuition_paid, 0) AS net_income
FROM PEOPLE AS P
LEFT JOIN (
SELECT person_ssn, SUM(income_earned) AS total_income_earned
FROM EMPLOYMENT_RECORDS
GROUP BY person_ssn
) AS J
ON P.ssn = J.person_ssn
LEFT JOIN (
SELECT person_ssn, SUM(tuition_paid) AS total_tuition_paid
FROM EDUCATION_RECORDS
) AS S
ON P.ssn = S.person_ssn
```

Internally, PyDough solves the question by translating the much simpler logical document model logic into SQL, which can be directly executed on a database. Even if the same SQL is generated by PyDough as the example above, all a user needs to worry about is writing the much smaller PyDough code snippet in Python.

Currently, the main mechanism to execute PyDough code is via Jupyter notebooks with a special cell magic. See the usage guide and demo notebooks for more details.

## Why Build PyDough?

PyDough as a DSL has several benefits over other solutions, both for human use and LLM generation:
- ORMs still require understanding & writing SQL, including dealing directly with joins. If a human or AI is bad at writing SQL, they will be just as bad at writing ORM-based code. PyDough, on the other hand, abstracts away joins in favor of thinking about logical relationships between collections & sub-collections.
- The complex semantics of aggregation keys, different types of joins, and aggregating before vs after joining are all abstracted away by PyDough. These details require much deeper understanding of SQL semantics than most have time to learn how to do correctly, meaning that PyDough can have a lower learning curve to write correct code for complex questions.
- When a question is being asked, the PyDough code to answer it will look more similar to the text of the question than the SQL text would. This makes LLM generation of PyDough code simpler since there is a stronger correlation between a question asked and the PyDough code to answer it.
- Often, PyDough code will be significantly more compact than equivalent SQL text, and therefore easier for a human to verify for logical correctness.
- PyDough is portable between various database execution solutions, so you are not locked into one data storage solution while using PyDough.

## Learning About PyDough

Refer to these documents to learn how to use PyDough:

- [Spec for the PyDough DSL](documentation/dsl.md)
- [Spec for the PyDough metadata](documentation/metadata.md)
- [List of builtin PyDough functions](documentation/functions.md)
- [Usage guide for PyDough](documentation/usage.md)
- [Spec for the PyDough DSL](https://github.com/bodo-ai/PyDough/blob/main/documentation/dsl.md)
- [Spec for the PyDough metadata](https://github.com/bodo-ai/PyDough/blob/main/documentation/metadata.md)
- [List of builtin PyDough functions](https://github.com/bodo-ai/PyDough/blob/main/documentation/functions.md)
- [Usage guide for PyDough](https://github.com/bodo-ai/PyDough/blob/main/documentation/usage.md)

## Installing or Developing PyDough

PyDough releases are [available on PyPI](https://pypi.org/project/pydough/) and can be installed via pip:

```
pip install pydough
```

For local development, PyDough uses `uv` as a package manager.
Please refer to their docs for [installation](https://docs.astral.sh/uv/getting-started/).

## Developing PyDough
PyDough uses `uv` as a package manager. Please refer to their docs for
[installation](https://docs.astral.sh/uv/getting-started/). To run testing
commands after installing `uv`, run the following command:

To run testing commands after installing `uv`, run the following command:

```bash
uv run pytest <pytest_arguments>
```

If you want to skip tests that execute runtime results because they are slower, make sure to include `-m "not slow"` in the pytest arguments.
If you want to skip tests that execute runtime results because they are slower,
make sure to include `-m "not execute"` in the pytest arguments.

Note: That some tests may require an additional setup to run successfully.
Please refer to the TPC-H demo directory for more information on how to setup
a default database for testing.
Note: some tests may require an additional setup to run successfully.
The [demos](https://github.com/bodo-ai/PyDough/blob/main/demos/README.md) directory
contains more information on how to setup the TPC-H sqlite database. For
testing, the `tpch.db` file must be located in the `tests` directory.
Additionally, the [`setup_defog.sh`](https://github.com/bodo-ai/PyDough/blob/main/tests/setup_defog.sh)
script must be run so that the `defog.db` file is located in the `tests` directory.

## Running CI Tests

To run our CI tests on your PR, you must include the flag `[run CI]` in latest
commit message.

Expand All @@ -43,4 +113,4 @@ The full list of dependencies can be found in the `pyproject.toml` file.

The `demo` folder contains a series of example Jupyter Notebooks
that can be used to understand PyDough's capabilities. We recommend any new user start
with the [demo readme](demos/README.md) and then walk through the example Juypter notebooks.
with the [demo readme](https://github.com/bodo-ai/PyDough/blob/main/demos/README.md) and then walk through the example Juypter notebooks.
12 changes: 11 additions & 1 deletion demos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,16 @@ PyDough simplicity will also be advantageous for LLM analytics generation.

The `notebooks` folder provides a series of Demos that one can engage with to understand PyDough's
capabilities and ultimately the experience of using PyDough. We strongly recommend starting with
[Introduction.ipynb](notebooks/Introduction.ipynb), as this notebook is intended to explain a high-level overview of the components
[1_introduction.ipynb](notebooks/1_introduction.ipynb), as this notebook is intended to explain a high-level overview of the components
of a PyDough notebook and what each step is doing. This notebook will then also explain how each
subsequent demo notebook is structured to provide insight into some aspect of PyDough.

> [!IMPORTANT]
> Before running any notebooks, you will need to run the script to set up the TPC-H database as a local sqlite database file. You must ensure the file is located in the root directory of PyDough and is named `tpch.db`. If using `bash`, this means running the following command from the root directory of PyDough: `bash demos/setup_tpch.sh tpch.db`
Once the introduction notebook is complete, you can explore the other notebooks:
- [2_pydough_operations.ipynb](notebooks/2_pydough_operations.ipynb) demonstrates all the core operations in PyDough
- [3_exploration.ipynb](notebooks/3_exploration.ipynb) shows how to use several user APIs for exploring PyDough.
- [4_tpch.ipynb](notebooks/4_tpch.ipynb) provides PyDough translations for most of the TPC-H benchmark queries.
- [5_what_if.ipynb](notebooks/5_what_if.ipynb) demonstrates how to do WHAT-IF analysis with PyDough.

Original file line number Diff line number Diff line change
Expand Up @@ -259,10 +259,10 @@
"# Additional Notebooks\n",
"\n",
"This notebook is intended as a simple introduction into pydough and it runs in a Jupyter notebook. To get a better understanding of PyDough's most impactful features, we have the following additional notebooks:\n",
"* [pydough_operations](./pydough_operations.ipynb): Provides a detailed overview of many of the core operations you can currently perform in PyDough, include some best practices and possible limitations.\n",
"* [basic_example](./basic_example.ipynb): Provides a walk-through of how you can leverage PyDough's contextless expressions to build the solution to an analytics business question from its components and ultimately iterate on that solution.\n",
"* [Exploration](./Exploration.ipynb): Explores our provided TPC-H metadata to help highlight some of the key metadata features and to teach users how to interact with the metadata.\n",
"* [tpch](./tpch.ipynb): Compares the SQL queries used in the TPC-H benchmarks to equivalent statements written in PyDough.\n"
"* [2_pydough_operations](./2_pydough_operations.ipynb): Provides a detailed overview of many of the core operations you can currently perform in PyDough, include some best practices and possible limitations.\n",
"* [3_exploration](./3_exploration.ipynb): Explores our provided TPC-H metadata to help highlight some of the key metadata features and to teach users how to interact with the metadata.\n",
"* [4_tpch](./4_tpch.ipynb): Compares the SQL queries used in the TPC-H benchmarks to equivalent statements written in PyDough.\n",
"* [5_what_if](./5_what_if.ipynb): Shows how to do WHAT-IF analysis with PyDough.\n"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@
"\n",
"# Datetime Operations\n",
"# Note: Since this is based on SQL lite the underlying date is a bit strange.\n",
"print(pydough.to_sql(lines(YEAR(ship_date), MONTH(ship_date), DAY(ship_date))))\n",
"print(pydough.to_sql(lines(YEAR(ship_date), MONTH(ship_date), DAY(ship_date),HOUR(ship_date),MINUTE(ship_date),SECOND(ship_date))))\n",
"\n",
"# Aggregation operations\n",
"print(pydough.to_sql(TPCH(NDISTINCT(nations.comment), SUM(nations.key))))\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"id": "f1fd4015-5281-4f56-bf88-5db4a8de93e0",
"metadata": {},
"source": [
"# Introduction\n",
"# Exploration\n",
"\n",
"This notebook is intended to explain various APIs available in PyDough to explain and explore the PyDough metadata and PyDough logical operations."
]
Expand Down
File renamed without changes.
File renamed without changes.
57 changes: 55 additions & 2 deletions documentation/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ Below is the list of every function/operator currently supported in PyDough as a
* [YEAR](#year)
* [MONTH](#month)
* [DAY](#day)
* [HOUR](#hour)
* [MINUTE](#minute)
* [SECOND](#second)
- [Conditional Functions](#conditional-functions)
* [IFF](#iff)
* [ISIN](#isin)
Expand All @@ -36,6 +39,8 @@ Below is the list of every function/operator currently supported in PyDough as a
- [Numerical Functions](#numerical-functions)
* [ABS](#abs)
* [ROUND](#round)
* [POWER](#power)
* [SQRT](#sqrt)
- [Aggregation Functions](#aggregation-functions)
* [SUM](#sum)
* [AVG](#avg)
Expand All @@ -59,10 +64,10 @@ Below is each binary operator currently supported in PyDough.
<!-- TOC --><a name="arithmetic"></a>
### Arithmetic

Supported mathematical operations: addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`).
Supported mathematical operations: addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`), exponentiation (`**`).

```py
Lineitems(value = (extended_price * (1 - discount) + 1.0) / part.retail_price)
Lineitems(value = (extended_price * (1 - (discount ** 2)) + 1.0) / part.retail_price)
```

> [!WARNING]
Expand Down Expand Up @@ -255,6 +260,36 @@ Calling `DAY` on a date/timestamp extracts the day of the month it belongs to:
Orders(is_first_of_month = DAY(order_date) == 1)
```

<!-- TOC --><a name="hour"></a>
### HOUR

Calling `HOUR` on a date/timestamp extracts the hour it belongs to. The range of output
is from 0-23:

```py
Orders(is_12pm = HOUR(order_date) == 12)
```

<!-- TOC --><a name="minute"></a>
### MINUTE

Calling `MINUTE` on a date/timestamp extracts the minute. The range of output
is from 0-59:

```py
Orders(is_half_hour = MINUTE(order_date) == 30)
```

<!-- TOC --><a name="second"></a>
### SECOND

Calling `SECOND` on a date/timestamp extracts the second. The range of output
is from 0-59:

```py
Orders(is_lt_30_seconds = SECOND(order_date) < 30)
```

<!-- TOC --><a name="conditional-functions"></a>
## Conditional Functions

Expand Down Expand Up @@ -349,6 +384,24 @@ The `ROUND` function rounds its first argument to the precision of its second ar
Parts(rounded_price = ROUND(retail_price, 1))
```

<!-- TOC --><a name="power"></a>
### POWER

The `POWER` function exponentiates its first argument to the power of its second argument.

```py
Parts(powered_price = POWER(retail_price, 2))
```

<!-- TOC --><a name="sqrt"></a>
### SQRT

The `SQRT` function takes the square root of its input. It's equivalent to `POWER(x,0.5)`.

```py
Parts(sqrt_price = SQRT(retail_price))
```

<!-- TOC --><a name="aggregation-functions"></a>
## Aggregation Functions

Expand Down
59 changes: 56 additions & 3 deletions documentation/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,14 @@ This document describes how to set up & interact with PyDough. For instructions
* [`pydough.explain_structure`](#pydoughexplain_structure)
* [`pydough.explain`](#pydoughexplain)
* [`pydough.explain_term`](#pydoughexplain_term)
- [Logging] (#logging)

<!-- TOC end -->

<!-- TOC --><a name="setting-up-in-jupyter-notebooks"></a>
## Setting Up in Jupyter Notebooks

Once you have uv set up, you can run the command `uv run jupyter lab` from the PyDough directory to boot up a Jupyter lab process that will have access to PyDough.
Once you have uv set up, you can run the command `uv run jupyter lab` from the PyDough directory to boot up a Jupyter lab process that will have access to PyDough. If you have installed PyDough via pip, you should be able to directly boot up a Jupyter lab process and it will have access to the PyDough module.

Once JupyterLab is running, you can either navigate to an existing notebook file or create a new one. In that notebook file, follow these steps in the notebook cells to work with PyDough:

Expand Down Expand Up @@ -265,6 +266,7 @@ The `to_df` API does all the same steps as the [`to_sql` API](#pydoughto_sql), b
- `metadata`: the PyDough knowledge graph to use for the conversion (if omitted, `pydough.active_session.metadata` is used instead).
- `config`: the PyDough configuration settings to use for the conversion (if omitted, `pydough.active_session.config` is used instead).
- `database`: the database context to use for the conversion (if omitted, `pydough.active_session.database` is used instead). The database context matters because it controls which SQL dialect is used for the translation.
- `display_sql`: displays the sql before executing in a logger.

Below is an example of using `pydough.to_df` and the output, attached to a sqlite database containing data for the TPC-H schema:

Expand Down Expand Up @@ -314,14 +316,14 @@ pydough.to_df(result)
</table>
</div>

See the [demo notebooks](../demos/notebooks/Introduction.ipynb) for more instances of how to use the `to_df` API.
See the [demo notebooks](../demos/notebooks/1_introduction.ipynb) for more instances of how to use the `to_df` API.

<!-- TOC --><a name="exploration-apis"></a>
## Exploration APIs

This sections describes various APIs you can use to explore PyDough code and figure out what each component is doing without having PyDough fully evaluate it.

See the [demo notebooks](../demos/notebooks/Exploration.ipynb) for more instances of how to use the exploration APIs.
See the [demo notebooks](../demos/notebooks/2_exploration.ipynb) for more instances of how to use the exploration APIs.

<!-- TOC --><a name="pydoughexplain_structure"></a>
### `pydough.explain_structure`
Expand Down Expand Up @@ -730,3 +732,54 @@ This term is singular with regards to the collection, meaning it can be placed i
For example, the following is valid:
TPCH.nations.WHERE(region.name == 'EUROPE')(AVG(customers.acctbal))
```
## Logging

Logging is enabled and set to INFO level by default. We can change the log level by setting the environment variable `PYDOUGH_LOG_LEVEL` to the standard levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.

A new `logger` object can be created using `get_logger`.
This function configures and returns a logger instance. It takes the following arguments:

- `name` : The logger's name, typically the module name (`__name__`).
- `default_level` : The default logging level if not set externally via environment variable `PYDOUGH_LOG_LEVEL`. Defaults to `logging.INFO`.
- `fmt` : An optional log message format compatible with Python's logging. The default format is `"%(asctime)s [%(levelname)s] %(name)s: %(message)s"`.
- `handlers` : An optional list of logging handlers to attach to the logger.

It returns a configured `logging.Logger` instance.
Here is an example of basic usage. We have not set the environment variable, hence the default level of logging is INFO.

```py
from pydough import get_logger
pyd_logger = get_logger(__name__)

logger.info("This is an info message.")
logger.error("This is an error message.")
```

We can also set the level of logging via a function argument. Note that if `PYDOUGH_LOG_LEVEL` is available, the default_level argument is overriden.

```python
# Import the function
from pydough import get_logger

# Get logger with a custom name and level
logger = get_logger(name="custom_logger", default_level=logging.DEBUG)

# Log messages
logger.debug("This is a debug message.")
logger.warning("This is a warning message.")
```
We can also attach other handlers in addition to the default handler(`logging.StreamHandler(sys.stdout)`), by sending a list of handlers.

```python
import logging
from pydough import get_logger

# Create a file handler
file_handler = logging.FileHandler("logfile.log")

# Get logger with custom file handler
logger = get_logger(handlers=[file_handler])

# Log messages
logger.info("This message will go to the console and the file.")
```
2 changes: 2 additions & 0 deletions pydough/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,13 @@
"parse_json_metadata_from_file",
"to_df",
"to_sql",
"get_logger"
]

from .configs import PyDoughSession
from .evaluation import to_df, to_sql
from .exploration import explain, explain_structure, explain_term
from .logger import get_logger
from .metadata import parse_json_metadata_from_file
from .unqualified import display_raw, init_pydough_context

Expand Down
Loading

0 comments on commit 4818c59

Please sign in to comment.