Skip to content

Commit

Permalink
docs: Stub more doc content (#668)
Browse files Browse the repository at this point in the history
Pulling in stuff from existing docs and past drafts.
  • Loading branch information
bjchambers authored Aug 17, 2023
1 parent f6f597a commit 1fb920f
Show file tree
Hide file tree
Showing 16 changed files with 353 additions and 26 deletions.
1 change: 0 additions & 1 deletion python/docs/source/concepts.md

This file was deleted.

5 changes: 5 additions & 0 deletions python/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@
"use_issues_button": True,
"repository_branch": "main",
"path_to_docs": "kaskada/docs/source",
"announcement": (
"This describes the next version of Kaskada. "
"It is currently available as an alpha release."
),
"icon_links": [
{
"name": "GitHub",
Expand All @@ -54,6 +58,7 @@
],
"primary_sidebar_end": ["indices.html"],
"show_toc_level": 2,
"show_nav_level": 2,
}

templates_path = ["_templates"]
Expand Down
3 changes: 3 additions & 0 deletions python/docs/source/guide/aggregation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Aggregation

## Windowing
108 changes: 108 additions & 0 deletions python/docs/source/guide/data_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Data Types

Kaskada operates on typed Timestreams.
Similar to how every Pandas `DataFrame` has an associated `dtype`, every Kaskada `Timestream` has an associated type.
The set of supported types is based on the types supported by [Apache Arrow](https://arrow.apache.org/).

Each `Timestream` contains points of the corresponding type.
We'll often say that the "type" of a `Timestream` is the type of the values it contains.

Kaskada's type system describes several kinds of values.
Scalar types correspond to simple values, such as the string `"hello"` or the integer `57`.
They correspond to a stream containing values of the given type, or `null`.
Composite types are created from other types.
For instance, records may be created using scalar and other composite types as fields.
An expression producing a record type is a stream that produces a value of the given record type or `null`.

## Scalar Types

Scalar types include booleans, numbers, strings, timestamps, durations and calendar intervals.

:::{list-table} Scalar Types
:widths: 1, 3
:header-rows: 1

- * Types
* Description
- * `bool`
* Booleans represent true or false.

Examples: `true`, `false`.
- * `u8`, `u16`, `u32`, `u64`
* Unsigned integer numbers of the specified bit width.

Examples: `0`, `1`, `1000`
- * `i8`, `i16`, `i32`, `i64`
* Signed integer numbers of the specified bit width.

Examples: `0`, `1`, `-100`
- * `f32`, `f64`
* Floating point numbers of the specified bit width.

Examples: `0`, `1`, `-100`, `1000`, `0.0`, `-1.0`, `-100837.631`.
- * `str`
* Unicode strings.

Examples: `"hello", "hi 'bob'"`.

- * `timestamp_s`, `timestamp_ms`, `timestamp_us`, `timestamp_ns`
* Points in time relative the Unix Epoch (00:00:00 UTC on January 1, 1970).
Time unit may be seconds (s), milliseconds (ms), microseconds (us) or nanoseconds (ns).

Examples: `1639595174 as timestamp_s`
- * `duration_s`, `duration_ms`, `duration_us`, `duration_ns`
* A duration of a fixed amount of a specific time unit.
Time unit may be seconds (s), milliseconds (ms), microseconds (us) or nanoseconds (ns).

Examples: `-100 as duration_ms`
- * `interval_days`, `interval_months`
* A calendar interval corresponding to the given amount of the corresponding time.
The length of an interval depends on the point in time it is added to.
For instance, adding 1 `interval_month` to a timestamp will shift to the same day of the next month.

Examples: `1 as interval_days`, `-100 as interval_months`
:::

## Record Types

Records allow combining 1 or more values of potentially different types into a single value.
Records are unnamed - any two records with the same set of field names and value types are considered equal. Fields within a record may have different types.
Field names must start with a letter.

For example, `{name: string, age: u32 }` is a record type with two fields and `{name: 'Ben', age: 33 }` is corresponding value.

NOTE: Record types may be nested.

## Type Coercion
Kaskada implicitly coerces numeric types when different kinds of numbers are combined.
For example adding a 64-bit signed integer value to a 32-bit floating point value produces a 64-point floating point value

Type coercion will never produce an integer overflow or reduction in numeric precision.
If needed, such conversions must be explicitly specified using `as`.

The coercion rules can be summarized with the following rules:

1. Unsigned integers can be widened: `u8``u16``u32``u64`.
2. Integers can be widened: `i8``i16``i32``i64`.
3. Floating point numbers can be widened: `f16``f32``f64`.
4. Unsigned integers can be promoted to the next wider integer `u8``i16`, `u16``i32`, `u32``i64`.
5. All numbers may be converted to `f64`.
6. Strings may be implicitly converted to timestamps by attempting to parse them as RFC3339 values.
The timestamp will be null for strings that don't successfully parse.

One aspect of the coercion rules is that when an operation is applied to two different numeric types the result may be a third type which they may both be coerced to.
The type promotion table shows the type resulting from a binary operation involving two different numeric types.

| | `u8` | `u16` | `u32` | `u64` | `i8` | `i16` | `i32` | `i64` | `f16` | `f32` | `f64` |
| --------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| **`u8`** | `u8` | `u16` | `u32` | `u64` | `i16` | `i16` | `i32` | `i64` | `f16` | `f32` | `f64` |
| **`u16`** | `u16` | `u16` | `u32` | `u64` | `i32` | `i32` | `i32` | `i64` | `f16` | `f32` | `f64` |
| **`u32`** | `u32` | `u32` | `u32` | `u64` | `i64` | `i64` | `i64` | `i64` | `f32` | `f32` | `f64` |
| **`u64`** | `u64` | `u64` | `u64` | `u64` | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` |
| **`i8`** | `i16` | `i32` | `i64` | `f64` | `i8` | `i16` | `i32` | `i64` | `f16` | `f32` | `f64` |
| **`i16`** | `i16` | `i32` | `i64` | `f64` | `i16` | `i16` | `i32` | `i64` | `f16` | `f32` | `f64` |
| **`i32`** | `i32` | `i32` | `i64` | `f64` | `i32` | `i32` | `i32` | `i64` | `f16` | `f32` | `f64` |
| **`i64`** | `i64` | `i64` | `i64` | `f64` | `i64` | `i64` | `i64` | `i64` | `f16` | `f32` | `f64` |
| **`f16`** | `f16` | `f16` | `f16` | `f16` | `f16` | `f16` | `f16` | `f16` | `f16` | `f32` | `f64` |
| **`f32`** | `f32` | `f32` | `f32` | `f32` | `f32` | `f32` | `f32` | `f32` | `f32` | `f32` | `f64` |
| **`f64`** | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` | `f64` |
61 changes: 61 additions & 0 deletions python/docs/source/guide/entities.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Entities and Grouping

Entities organize data for use in feature engineering.
They describe the particular objects that a prediction will be made for.
The result of a feature computation is a _feature vector_ for each entity at various points in time.

## What is an Entity?
Entities represent the categories or "nouns" associated with the data.
They can generally be thought of as any category of object related to the events being processed.
For example, when manipulating purchase events, there may be entities for the customers, vendors and items being purchased.
Each purchase event may be related to a customer, a vendor, and one or more items.

If something can be given a name or other unique identifier, it can likely be used as an entity.
In a relational database, an entity would be anything that is identified by the same key in a set of tables.

## What is an Entity Key?
An entity kind is a category of objects, for example customer or vendor.
An entity key identifies a unique instance of that category -- a `customer_id` or a `vendor_id`.

One may think of an entity as a table containing instances -- or rows -- of that type of entity.
The entity key would be the primary key of that table.

The following table shows some example entities and possible keys.
Many of the example instances may not be suitable for use as the entity key, for the same reason you wouldn't use them as a primary key.
For example, using `Vancouver` to identify cities would lead to ambiguity between Vancouver in British Columbia and Vancouver in Washington State.
In these cases, you'd likely use some other identifier for instances.
Others may be useful, such as using the airport code.

:::{list-table} Example Entities and corresponding keys.
:header-rows: 1

* - Example Entity
- Example Entity Instance
* - Houses
- 1600 Pennsylvania Avenue
* - Airports
- SEA
* - Customers
- John Doe
* - City
- Vancouver
* - State
- Washington
:::

## Entities and Aggregation

Many, if not all, Kaskada queries involve aggregating events to produce values.
Entities provide an implicit grouping for the aggregation.
When we write `sum(Purchases.amount)` it is an aggregation that returns the sum of purchases made _by each entity_.
This is helpful since the _feature vector_ for an entity will depend only on events related to that entity.

```{todo}
Example of grouped streams and aggregation
```

## Joining

Joining with the same entity happens automatically.
Joining with other entities (and even other kinds of entities) is done using `lookup`.
See [Joins](joins.md) for more information.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction
# User Guide

Understanding and reacting to the world in real-time requires understanding what is happening _now_ in the context of what happened in the past.
You need the ability to understand if what just happened is unusual, how it relates to what happened previously, and how it relates to other things that are happening at the same time.
Expand All @@ -12,14 +12,10 @@ Use time-travel to compute training examples from historic data and understand h

## What are "Timestreams"?

A [Timestream](../reference/timestream/index) describes how a value changes over time. In the same way that SQL
queries transform tables and graph queries transform nodes and edges,
Kaskada queries transform Timestreams.
A [Timestream](timestreams) describes how a value changes over time.
In the same way that SQL queries transform tables and graph queries transform nodes and edges, Kaskada queries transform Timestreams.

In comparison to a timeseries which often contains simple values (e.g., numeric
observations) defined at fixed, periodic times (i.e., every minute), a Timestream
contains any kind of data (records or collections as well as primitives) and may
be defined at arbitrary times corresponding to when the events occur.
In comparison to a timeseries which often contains simple values (e.g., numeric observations) defined at fixed, periodic times (i.e., every minute), a Timestream contains any kind of data (records or collections as well as primitives) and may be defined at arbitrary times corresponding to when the events occur.

## Getting Started with Timestreams

Expand All @@ -35,4 +31,17 @@ data = t.sources.Parquet.from_file(
key = "user")
# Get the count of events associated with each user over time, as a dataframe.
data.count().run().to_pandas()
```

```{toctree}
:hidden:
:maxdepth: 2
installation
timestreams
data_types
entities
aggregation
joins
sources
```
25 changes: 25 additions & 0 deletions python/docs/source/guide/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Installation

To install Kaskada, you need to be using Python >= 3.8.
We suggest using 3.11 or newer, since that provides more precise error locations.

```{code-block} bash
:caption: Installing Kaskada
pip install kaskada>=0.6.0-a.0
```

```{warning}
This version of Kaskada is currently a pre-release, as indicated by the `-a.0` suffix.
It will not be installed by default if you `pip install kaskada`.
You need to either use `pip install --pre kaskada` or specify a specific version, as shown in the example.
```

```{admonition} Pip and pip3 and permissions
:class: tip
Depending on you Python installation and configuration you may have `pip3` instead of `pip` available in your terminal.
If you do have `pip3` replace pip with `pip3` in your command, i.e., `pip3 install kaskada`.
If you get a permission error when running the `pip` command, you may need to run as an administrator using `sudo pip install kaskada`.
If you don't have administrator access (e.g., in Google Colab, or other hosted environments) you amy use `pip`’s `--user` flag to install the package in your user directory.
```
18 changes: 18 additions & 0 deletions python/docs/source/guide/joins.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Joins


## Domains and Implicit Joins

It is sometimes useful to consider the _domain_ of an expression.
This corresponds to the points in time and entities associated with the points in the expression.
For discrete timestreams, this corresponds to the points at which those values occur.
For continuous timestreams, this corresponds to the points at which the value changes.

Whenever expressions with two (or more) different domains are used in the same expression they are implicitly joined.
The join is an outer join that contains an event if either (any) of the input domains contained an event.
For any input table that is continuous, the join is `as of` the time of the output, taking the latest value from that input.


## Implicit Joins

## Explicit Lookups
1 change: 1 addition & 0 deletions python/docs/source/guide/sources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Sources
3 changes: 3 additions & 0 deletions python/docs/source/guide/timestreams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Timestreams

## Continuity
13 changes: 4 additions & 9 deletions python/docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,18 +86,13 @@ Compute temporal joins at the correct times, without risk of leakage.

```{toctree}
:hidden:
:maxdepth: 3
why
tour
quickstart
concepts
examples/index
```

```{toctree}
:caption: User Guide
:hidden:
:maxdepth: 1
guide/introduction
guide/index
```

```{toctree}
Expand Down
12 changes: 6 additions & 6 deletions python/docs/source/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ kd.init_session()
content = "\n".join(
[
"time,key,m,n",
"1996-12-19T16:39:57-08:00,A,5,10",
"1996-12-19T16:39:58-08:00,B,24,3",
"1996-12-19T16:39:59-08:00,A,17,6",
"1996-12-19T16:40:00-08:00,A,,9",
"1996-12-19T16:40:01-08:00,A,12,",
"1996-12-19T16:40:02-08:00,A,,",
"1996-12-19T16:39:57,A,5,10",
"1996-12-19T16:39:58,B,24,3",
"1996-12-19T16:39:59,A,17,6",
"1996-12-19T16:40:00,A,,9",
"1996-12-19T16:40:01,A,12,",
"1996-12-19T16:40:02,A,,",
]
)
source = kd.sources.CsvString(content, time_column_name="time", key_column_name="key")
Expand Down
1 change: 0 additions & 1 deletion python/docs/source/reference/timestream/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@

```{toctree}
:hidden:
:maxdepth: 3
aggregation
arithmetic
Expand Down
Loading

0 comments on commit 1fb920f

Please sign in to comment.