-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add time-centric example notebook (#678)
- Loading branch information
1 parent
4e7199b
commit 03becc8
Showing
4 changed files
with
363 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,9 @@ | ||
# Examples | ||
# Examples | ||
|
||
|
||
```{toctree} | ||
:hidden: | ||
:maxdepth: 2 | ||
time_centric | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,343 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "5a20a51f", | ||
"metadata": { | ||
"id": "5a20a51f" | ||
}, | ||
"source": [ | ||
"# Time-centric Calculations\n", | ||
"\n", | ||
"Kaskada was built to process and perform temporal calculations on event streams,\n", | ||
"with real-time analytics and machine learning in mind. It is not exclusively for\n", | ||
"real-time applications, but Kaskada excels at time-centric computations and\n", | ||
"aggregations on event-based data.\n", | ||
"\n", | ||
"For example, let's say you're building a user analytics dashboard at an\n", | ||
"ecommerce retailer. You have event streams showing all actions the user has\n", | ||
"taken, and you'd like to include in the dashboard:\n", | ||
"* the total number of events the user has ever generated\n", | ||
"* the total number of purchases the user has made\n", | ||
"* the total revenue from the user\n", | ||
"* the number of purchases made by the user today\n", | ||
"* the total revenue from the user today\n", | ||
"* the number of events the user has generated in the past hour\n", | ||
"\n", | ||
"Because the calculations needed here are a mix of hourly, daily, and over all of\n", | ||
"history, more than one type of event aggregation needs to happen. Table-centric\n", | ||
"tools like those based on SQL would require multiple JOINs and window functions,\n", | ||
"which would be spread over multiple queries or CTEs. \n", | ||
"\n", | ||
"Kaskada was designed for these types of time-centric calculations, so we can do\n", | ||
"each of the calculations in the list in one line:\n", | ||
"\n", | ||
"```python\n", | ||
"record({\n", | ||
" \"event_count_total\": DemoEvents.count(),\n", | ||
" \"purchases_total_count\": DemoEvents.filter(DemoEvents.col(\"event_name\").eq(\"purchase\")).count(),\n", | ||
" \"revenue_total\": DemoEvents.col(\"revenue\").sum(),\n", | ||
" \"purchases_daily\": DemoEvents.filter(DemoEvents.col(\"event_name\").eq(\"purchase\")).count(window=Daily()),\n", | ||
" \"revenue_daily\": DemoEvents.col(\"revenue\").sum(window=Daily()),\n", | ||
" \"event_count_hourly\": DemoEvents.count(window=Hourly()),\n", | ||
"})\n", | ||
"```\n", | ||
"\n", | ||
"```{warning}\n", | ||
"The previous example demonstrates the use of `Daily()` and `Hourly()` windowing which aren't yet part of the new Python library.\n", | ||
"```\n", | ||
"\n", | ||
"Of course, a few more lines of code are needed to put these calculations to work,\n", | ||
"but these six lines are all that is needed to specify the calculations\n", | ||
"themselves. Each line may specify:\n", | ||
"* the name of a calculation (e.g. `event_count_total`)\n", | ||
"* the input data to start with (e.g. `DemoEvents`)\n", | ||
"* selecting event fields (e.g. `DemoEvents.col(\"revenue\")`)\n", | ||
"* function calls (e.g. `count()`)\n", | ||
"* event filtering (e.g. `filter(DemoEvents.col(\"event_name\").eq(\"purchase\"))`)\n", | ||
"* time windows to calculate over (e.g. `window=Daily()`)\n", | ||
"\n", | ||
"...with consecutive steps chained together in a familiar way.\n", | ||
"\n", | ||
"Because Kaskada was built for time-centric calculations on event-based data, a\n", | ||
"calculation we might describe as \"total number of purchase events for the user\"\n", | ||
"can be defined in Kaskada in roughly the same number of terms as the verbal\n", | ||
"description itself.\n", | ||
"\n", | ||
"Continue through the demo below to find out how it works.\n", | ||
"\n", | ||
"See [the Kaskada documentation](../guide/index) for lots more information." | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "BJ2EE9mSGtGB", | ||
"metadata": { | ||
"id": "BJ2EE9mSGtGB" | ||
}, | ||
"source": [ | ||
"## Kaskada Client Setup\n", | ||
"\n", | ||
"```\n", | ||
"%pip install kaskada>=0.6.0-a.1\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "37db47ba", | ||
"metadata": { | ||
"tags": [ | ||
"hide-output" | ||
] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import kaskada as kd\n", | ||
"kd.init_session()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "5b838eef", | ||
"metadata": {}, | ||
"source": [ | ||
"## Example dataset\n", | ||
"\n", | ||
"For this demo, we'll use a very small example data set, which, for simplicity and portability of this demo notebook, we'll read from a string.\n", | ||
"\n", | ||
"```{note}\n", | ||
"For simplicity, instead of a CSV file or other file format we read and then parse data from a CSV string.\n", | ||
"You can load your own event data from many common sources, including Pandas DataFrames and Parquet files.\n", | ||
"See {py:mod}`kaskada.sources` for more information on the available sources.\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "ba4bb6b6", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# For demo simplicity, instead of a CSV file, we read and then parse data from a\n", | ||
"# CSV string. Kaskadaa\n", | ||
"event_data_string = '''\n", | ||
" event_id,event_at,entity_id,event_name,revenue\n", | ||
" ev_00001,2022-01-01 22:01:00,user_001,login,0\n", | ||
" ev_00002,2022-01-01 22:05:00,user_001,view_item,0\n", | ||
" ev_00003,2022-01-01 22:20:00,user_001,view_item,0\n", | ||
" ev_00004,2022-01-01 23:10:00,user_001,view_item,0\n", | ||
" ev_00005,2022-01-01 23:20:00,user_001,view_item,0\n", | ||
" ev_00006,2022-01-01 23:40:00,user_001,purchase,12.50\n", | ||
" ev_00007,2022-01-01 23:45:00,user_001,view_item,0\n", | ||
" ev_00008,2022-01-01 23:59:00,user_001,view_item,0\n", | ||
" ev_00009,2022-01-02 05:30:00,user_001,login,0\n", | ||
" ev_00010,2022-01-02 05:35:00,user_001,view_item,0\n", | ||
" ev_00011,2022-01-02 05:45:00,user_001,view_item,0\n", | ||
" ev_00012,2022-01-02 06:10:00,user_001,view_item,0\n", | ||
" ev_00013,2022-01-02 06:15:00,user_001,view_item,0\n", | ||
" ev_00014,2022-01-02 06:25:00,user_001,purchase,25\n", | ||
" ev_00015,2022-01-02 06:30:00,user_001,view_item,0\n", | ||
" ev_00016,2022-01-02 06:31:00,user_001,purchase,5.75\n", | ||
" ev_00017,2022-01-02 07:01:00,user_001,view_item,0\n", | ||
" ev_00018,2022-01-01 22:17:00,user_002,view_item,0\n", | ||
" ev_00019,2022-01-01 22:18:00,user_002,view_item,0\n", | ||
" ev_00020,2022-01-01 22:20:00,user_002,view_item,0\n", | ||
"'''\n", | ||
"\n", | ||
"events = kd.sources.CsvString(event_data_string,\n", | ||
" time_column_name='event_at',\n", | ||
" key_column_name = 'entity_id')\n", | ||
"\n", | ||
"# Inspect the event data\n", | ||
"events.preview()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "568d1272", | ||
"metadata": {}, | ||
"source": [ | ||
"## Define queries and calculations" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "c2c5a298", | ||
"metadata": {}, | ||
"source": [ | ||
"Kaskada query language is parsed by the `fenl` extension. Query calculations are\n", | ||
"defined in a code blocks starting with `%%fenl`.\n", | ||
"\n", | ||
"See [the `fenl`\n", | ||
"documentation](https://kaskada-ai.github.io/docs-site/kaskada/main/fenl/fenl-quick-start.html)\n", | ||
"for more information.\n", | ||
"\n", | ||
"Let's do a simple query for events for a specific entity ID.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "bce22e47", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"events.filter(events.col(\"entity_id\").eq(\"user_002\")).preview()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "6b5f2725", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"Beyond querying for events, Kaskada has a powerful syntax for defining\n", | ||
"calculations on events, temporally across history.\n", | ||
"\n", | ||
"The six calculations discussed at the top of this demo notebook are below.\n", | ||
"\n", | ||
"(Note that some functions return `NaN` if no events for that user have occurred\n", | ||
"within the time window.)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "3ad6d596", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"purchases = events.filter(events.col(\"event_name\").eq(\"purchase\"))\n", | ||
"\n", | ||
"features = kd.record({\n", | ||
" \"event_count_total\": events.count(),\n", | ||
" #\"event_count_hourly\": events.count(window=Hourly()),\n", | ||
" \"purchases_total_count\": purchases.count(),\n", | ||
" #\"purchases_today\": purchases.count(window=Since(Daily()),\n", | ||
" #\"revenue_today\": events.col(\"revenue\").sum(window=Since(Daily())),\n", | ||
" \"revenue_total\": events.col(\"revenue\").sum(),\n", | ||
"})\n", | ||
"features.preview()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "1c315938", | ||
"metadata": {}, | ||
"source": [ | ||
"## At Any Time\n", | ||
"\n", | ||
"A key feature of Kaskada's time-centric design is the ability to query for\n", | ||
"calculation values at any point in time. Traditional query languages (e.g. SQL)\n", | ||
"can only return data that already exists---if we want to return a row of\n", | ||
"computed/aggregated data, we have to compute the row first, then return it. As a\n", | ||
"specific example, suppose we have SQL queries that produce daily aggregations\n", | ||
"over event data, and now we want to have the same aggregations on an hourly\n", | ||
"basis. In SQL, we would need to write new queries for hourly aggregations; the\n", | ||
"queries would look very similar to the daily ones, but they would still be\n", | ||
"different queries.\n", | ||
"\n", | ||
"With Kaskada, we can define the calculations once, and then specify the points\n", | ||
"in time at which we want to know the calculation values when we query them.\n", | ||
"\n", | ||
"In the examples so far, we have used `preview()` to get a DataFrame containing\n", | ||
"some of the rows from the Timestreams we've defined. By default, this produces\n", | ||
"a _history_ containing all the times the result changed. This is useful for\n", | ||
"using past values to create training examples.\n", | ||
"\n", | ||
"We can also execute the query for the values at a specific point in time." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "082e174d", | ||
"metadata": { | ||
"tags": [ | ||
"hide-output" | ||
] | ||
}, | ||
"source": [ | ||
"```\n", | ||
"features.preview(at=\"2022-01-01 22:00\")\n", | ||
"``````" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5a44c5f7", | ||
"metadata": {}, | ||
"source": [ | ||
"You can also compose a query that produces values at specific points in time.\n", | ||
"\n", | ||
"```\n", | ||
"features.when(hourly())\n", | ||
"```\n", | ||
"\n", | ||
"Regardless of the time cadence of the calculations themselves, the query output\n", | ||
"can contain rows for whatever time points you specify. You can define a set of\n", | ||
"daily calculations and then get hourly updates during the day. Or, you can\n", | ||
"publish the definitions of some features in a Python module and different users\n", | ||
"can query those same calculations for hourly, daily, and monthly\n", | ||
"values---without editing the calculation definitions themselves.\n", | ||
"\n", | ||
"## Adding more calculations to the query\n", | ||
"\n", | ||
"We can add two new calculations, also in one line each, representing:\n", | ||
"* the time of the user's first event\n", | ||
"* the time of the user's last event\n" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "2ba09e77-0fdf-43f4-960b-50a126262ec7", | ||
"metadata": { | ||
"id": "2ba09e77-0fdf-43f4-960b-50a126262ec7" | ||
}, | ||
"source": [ | ||
"This is only a small sample of possible Kaskada queries and capabilities. See\n", | ||
"everything that's possible with [Timestreams](../reference/timestream/index.md)." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"collapsed_sections": [ | ||
"6924ca3e-28b3-4f93-b0cf-5f8afddc11d8", | ||
"936700a9-e042-401c-9156-7bb18652e109", | ||
"08f5921d-36dc-41d1-a2a6-ae800b7a11de" | ||
], | ||
"private_outputs": true, | ||
"provenance": [] | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,16 @@ | ||
# Sources | ||
|
||
```{eval-rst} | ||
.. currentmodule:: kaskada.sources | ||
.. autosummary:: | ||
:toctree: apidocs/sources | ||
.. automodule:: kaskada.sources | ||
Source | ||
CsvString | ||
JsonlString | ||
Pandas | ||
Parquet | ||
PyList | ||
.. autosummary:: | ||
:toctree: apidocs/sources | ||
Source | ||
CsvString | ||
JsonlString | ||
Pandas | ||
Parquet | ||
PyList | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters