Add time-centric example notebook (#678)

kaskada-ai · Aug 21, 2023 · 03becc8 · 03becc8
1 parent 4e7199b
commit 03becc8
Show file tree

Hide file tree

Showing 4 changed files with 363 additions and 11 deletions.
diff --git a/python/docs/source/examples/index.md b/python/docs/source/examples/index.md
@@ -1 +1,9 @@
-# Examples
+# Examples
+
+
+```{toctree}
+:hidden:
+:maxdepth: 2
+
+time_centric
+```
diff --git a/python/docs/source/examples/time_centric.ipynb b/python/docs/source/examples/time_centric.ipynb
@@ -0,0 +1,343 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "5a20a51f",
+   "metadata": {
+    "id": "5a20a51f"
+   },
+   "source": [
+    "# Time-centric Calculations\n",
+    "\n",
+    "Kaskada was built to process and perform temporal calculations on event streams,\n",
+    "with real-time analytics and machine learning in mind. It is not exclusively for\n",
+    "real-time applications, but Kaskada excels at time-centric computations and\n",
+    "aggregations on event-based data.\n",
+    "\n",
+    "For example, let's say you're building a user analytics dashboard at an\n",
+    "ecommerce retailer. You have event streams showing all actions the user has\n",
+    "taken, and you'd like to include in the dashboard:\n",
+    "* the total number of events the user has ever generated\n",
+    "* the total number of purchases the user has made\n",
+    "* the total revenue from the user\n",
+    "* the number of purchases made by the user today\n",
+    "* the total revenue from the user today\n",
+    "* the number of events the user has generated in the past hour\n",
+    "\n",
+    "Because the calculations needed here are a mix of hourly, daily, and over all of\n",
+    "history, more than one type of event aggregation needs to happen. Table-centric\n",
+    "tools like those based on SQL would require multiple JOINs and window functions,\n",
+    "which would be spread over multiple queries or CTEs. \n",
+    "\n",
+    "Kaskada was designed for these types of time-centric calculations, so we can do\n",
+    "each of the calculations in the list in one line:\n",
+    "\n",
+    "```python\n",
+    "record({\n",
+    "    \"event_count_total\": DemoEvents.count(),\n",
+    "    \"purchases_total_count\": DemoEvents.filter(DemoEvents.col(\"event_name\").eq(\"purchase\")).count(),\n",
+    "    \"revenue_total\": DemoEvents.col(\"revenue\").sum(),\n",
+    "    \"purchases_daily\": DemoEvents.filter(DemoEvents.col(\"event_name\").eq(\"purchase\")).count(window=Daily()),\n",
+    "    \"revenue_daily\": DemoEvents.col(\"revenue\").sum(window=Daily()),\n",
+    "    \"event_count_hourly\": DemoEvents.count(window=Hourly()),\n",
+    "})\n",
+    "```\n",
+    "\n",
+    "```{warning}\n",
+    "The previous example demonstrates the use of `Daily()` and `Hourly()` windowing which aren't yet part of the new Python library.\n",
+    "```\n",
+    "\n",
+    "Of course, a few more lines of code are needed to put these calculations to work,\n",
+    "but these six lines are all that is needed to specify the calculations\n",
+    "themselves. Each line may specify:\n",
+    "* the name of a calculation (e.g. `event_count_total`)\n",
+    "* the input data to start with (e.g. `DemoEvents`)\n",
+    "* selecting event fields (e.g. `DemoEvents.col(\"revenue\")`)\n",
+    "* function calls (e.g. `count()`)\n",
+    "* event filtering (e.g. `filter(DemoEvents.col(\"event_name\").eq(\"purchase\"))`)\n",
+    "* time windows to calculate over (e.g. `window=Daily()`)\n",
+    "\n",
+    "...with consecutive steps chained together in a familiar way.\n",
+    "\n",
+    "Because Kaskada was built for time-centric calculations on event-based data, a\n",
+    "calculation we might describe as \"total number of purchase events for the user\"\n",
+    "can be defined in Kaskada in roughly the same number of terms as the verbal\n",
+    "description itself.\n",
+    "\n",
+    "Continue through the demo below to find out how it works.\n",
+    "\n",
+    "See [the Kaskada documentation](../guide/index) for lots more information."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "BJ2EE9mSGtGB",
+   "metadata": {
+    "id": "BJ2EE9mSGtGB"
+   },
+   "source": [
+    "## Kaskada Client Setup\n",
+    "\n",
+    "```\n",
+    "%pip install kaskada>=0.6.0-a.1\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "37db47ba",
+   "metadata": {
+    "tags": [
+     "hide-output"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "import kaskada as kd\n",
+    "kd.init_session()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "5b838eef",
+   "metadata": {},
+   "source": [
+    "## Example dataset\n",
+    "\n",
+    "For this demo, we'll use a very small example data set, which, for simplicity and portability of this demo notebook, we'll read from a string.\n",
+    "\n",
+    "```{note}\n",
+    "For simplicity, instead of a CSV file or other file format we read and then parse data from a CSV string.\n",
+    "You can load your own event data from many common sources, including Pandas DataFrames and Parquet files.\n",
+    "See {py:mod}`kaskada.sources` for more information on the available sources.\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ba4bb6b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# For demo simplicity, instead of a CSV file, we read and then parse data from a\n",
+    "# CSV string. Kaskadaa\n",
+    "event_data_string = '''\n",
+    "    event_id,event_at,entity_id,event_name,revenue\n",
+    "    ev_00001,2022-01-01 22:01:00,user_001,login,0\n",
+    "    ev_00002,2022-01-01 22:05:00,user_001,view_item,0\n",
+    "    ev_00003,2022-01-01 22:20:00,user_001,view_item,0\n",
+    "    ev_00004,2022-01-01 23:10:00,user_001,view_item,0\n",
+    "    ev_00005,2022-01-01 23:20:00,user_001,view_item,0\n",
+    "    ev_00006,2022-01-01 23:40:00,user_001,purchase,12.50\n",
+    "    ev_00007,2022-01-01 23:45:00,user_001,view_item,0\n",
+    "    ev_00008,2022-01-01 23:59:00,user_001,view_item,0\n",
+    "    ev_00009,2022-01-02 05:30:00,user_001,login,0\n",
+    "    ev_00010,2022-01-02 05:35:00,user_001,view_item,0\n",
+    "    ev_00011,2022-01-02 05:45:00,user_001,view_item,0\n",
+    "    ev_00012,2022-01-02 06:10:00,user_001,view_item,0\n",
+    "    ev_00013,2022-01-02 06:15:00,user_001,view_item,0\n",
+    "    ev_00014,2022-01-02 06:25:00,user_001,purchase,25\n",
+    "    ev_00015,2022-01-02 06:30:00,user_001,view_item,0\n",
+    "    ev_00016,2022-01-02 06:31:00,user_001,purchase,5.75\n",
+    "    ev_00017,2022-01-02 07:01:00,user_001,view_item,0\n",
+    "    ev_00018,2022-01-01 22:17:00,user_002,view_item,0\n",
+    "    ev_00019,2022-01-01 22:18:00,user_002,view_item,0\n",
+    "    ev_00020,2022-01-01 22:20:00,user_002,view_item,0\n",
+    "'''\n",
+    "\n",
+    "events = kd.sources.CsvString(event_data_string,\n",
+    "    time_column_name='event_at',\n",
+    "    key_column_name = 'entity_id')\n",
+    "\n",
+    "# Inspect the event data\n",
+    "events.preview()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "568d1272",
+   "metadata": {},
+   "source": [
+    "## Define queries and calculations"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c2c5a298",
+   "metadata": {},
+   "source": [
+    "Kaskada query language is parsed by the `fenl` extension. Query calculations are\n",
+    "defined in a code blocks starting with `%%fenl`.\n",
+    "\n",
+    "See [the `fenl`\n",
+    "documentation](https://kaskada-ai.github.io/docs-site/kaskada/main/fenl/fenl-quick-start.html)\n",
+    "for more information.\n",
+    "\n",
+    "Let's do a simple query for events for a specific entity ID.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bce22e47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "events.filter(events.col(\"entity_id\").eq(\"user_002\")).preview()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "6b5f2725",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Beyond querying for events, Kaskada has a powerful syntax for defining\n",
+    "calculations on events, temporally across history.\n",
+    "\n",
+    "The six calculations discussed at the top of this demo notebook are below.\n",
+    "\n",
+    "(Note that some functions return `NaN` if no events for that user have occurred\n",
+    "within the time window.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3ad6d596",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "purchases = events.filter(events.col(\"event_name\").eq(\"purchase\"))\n",
+    "\n",
+    "features = kd.record({\n",
+    "    \"event_count_total\": events.count(),\n",
+    "    #\"event_count_hourly\": events.count(window=Hourly()),\n",
+    "    \"purchases_total_count\": purchases.count(),\n",
+    "    #\"purchases_today\": purchases.count(window=Since(Daily()),\n",
+    "    #\"revenue_today\": events.col(\"revenue\").sum(window=Since(Daily())),\n",
+    "    \"revenue_total\": events.col(\"revenue\").sum(),\n",
+    "})\n",
+    "features.preview()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "1c315938",
+   "metadata": {},
+   "source": [
+    "## At Any Time\n",
+    "\n",
+    "A key feature of Kaskada's time-centric design is the ability to query for\n",
+    "calculation values at any point in time. Traditional query languages (e.g. SQL)\n",
+    "can only return data that already exists---if we want to return a row of\n",
+    "computed/aggregated data, we have to compute the row first, then return it. As a\n",
+    "specific example, suppose we have SQL queries that produce daily aggregations\n",
+    "over event data, and now we want to have the same aggregations on an hourly\n",
+    "basis. In SQL, we would need to write new queries for hourly aggregations; the\n",
+    "queries would look very similar to the daily ones, but they would still be\n",
+    "different queries.\n",
+    "\n",
+    "With Kaskada, we can define the calculations once, and then specify the points\n",
+    "in time at which we want to know the calculation values when we query them.\n",
+    "\n",
+    "In the examples so far, we have used `preview()` to get a DataFrame containing\n",
+    "some of the rows from the Timestreams we've defined. By default, this produces\n",
+    "a _history_ containing all the times the result changed. This is useful for\n",
+    "using past values to create training examples.\n",
+    "\n",
+    "We can also execute the query for the values at a specific point in time."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "082e174d",
+   "metadata": {
+    "tags": [
+     "hide-output"
+    ]
+   },
+   "source": [
+    "```\n",
+    "features.preview(at=\"2022-01-01 22:00\")\n",
+    "``````"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a44c5f7",
+   "metadata": {},
+   "source": [
+    "You can also compose a query that produces values at specific points in time.\n",
+    "\n",
+    "```\n",
+    "features.when(hourly())\n",
+    "```\n",
+    "\n",
+    "Regardless of the time cadence of the calculations themselves, the query output\n",
+    "can contain rows for whatever time points you specify. You can define a set of\n",
+    "daily calculations and then get hourly updates during the day. Or, you can\n",
+    "publish the definitions of some features in a Python module and different users\n",
+    "can query those same calculations for hourly, daily, and monthly\n",
+    "values---without editing the calculation definitions themselves.\n",
+    "\n",
+    "## Adding more calculations to the query\n",
+    "\n",
+    "We can add two new calculations, also in one line each, representing:\n",
+    "* the time of the user's first event\n",
+    "* the time of the user's last event\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2ba09e77-0fdf-43f4-960b-50a126262ec7",
+   "metadata": {
+    "id": "2ba09e77-0fdf-43f4-960b-50a126262ec7"
+   },
+   "source": [
+    "This is only a small sample of possible Kaskada queries and capabilities. See\n",
+    "everything that's possible with [Timestreams](../reference/timestream/index.md)."
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [
+    "6924ca3e-28b3-4f93-b0cf-5f8afddc11d8",
+    "936700a9-e042-401c-9156-7bb18652e109",
+    "08f5921d-36dc-41d1-a2a6-ae800b7a11de"
+   ],
+   "private_outputs": true,
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/python/docs/source/reference/sources.md b/python/docs/source/reference/sources.md
@@ -1,15 +1,16 @@
 # Sources
 
 ```{eval-rst}
-.. currentmodule:: kaskada.sources
 
-.. autosummary::
-   :toctree: apidocs/sources
+.. automodule:: kaskada.sources
 
-    Source
-    CsvString
-    JsonlString
-    Pandas
-    Parquet
-    PyList
+    .. autosummary::
+        :toctree: apidocs/sources
+
+        Source
+        CsvString
+        JsonlString
+        Pandas
+        Parquet
+        PyList
 ```
diff --git a/python/noxfile.py b/python/noxfile.py
@@ -135,7 +135,7 @@ def docs_build(session: nox.Session) -> None:
 @nox.session(python=python_versions[0])
 def docs(session: nox.Session) -> None:
     """Build and serve the documentation with live reloading on file changes."""
-    args = ["--open-browser", "docs/source", "docs/_build", "-j", "auto", "--ignore", "docs/source/apidocs/**", "--ignore", "*/api/*"]
+    args = ["--open-browser", "docs/source", "docs/_build", "-j", "auto", "--ignore", "docs/source/reference/apidocs/**", "--ignore", "*/api/*"]
     install(session, groups=["typecheck", "docs"])
 
     build_dir = Path("docs", "_build")