Skip to content

Commit

Permalink
feat: Rename Expr to Timestream; stub docs (#603)
Browse files Browse the repository at this point in the history
Main change is renaming `Expr` to `Timestream`.

Other changes include:
- Adding methods (and tests) for basic math and comparison operators.
- Getting `nox -s docs` working for some rudimentary documentation
(WIP).
- Fleshing out docs for existing methods.
  • Loading branch information
bjchambers authored Aug 4, 2023
1 parent 962c15d commit 2bb67c1
Show file tree
Hide file tree
Showing 28 changed files with 1,008 additions and 505 deletions.
10 changes: 0 additions & 10 deletions crates/sparrow-session/src/expr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,6 @@ impl Expr {
_ => None,
}
}

pub fn equivalent(&self, other: &Expr) -> bool {
// This isn't quite correct -- we should lock everything and then compare.
// But, this is a temporary hack for the Python builder.
self.0.value() == other.0.value()
&& self.0.is_new() == other.0.is_new()
&& self.0.value_type() == other.0.value_type()
&& self.0.grouping() == other.0.grouping()
&& self.0.time_domain() == other.0.time_domain()
}
}

pub enum Literal {
Expand Down
3 changes: 1 addition & 2 deletions sparrow-py/.flake8
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
[flake8]
select = B,B9,C,D,DAR,E,F,N,W
select = B,B9,C,E,F,N,W
ignore = E203,E501,W503
max-line-length = 100
max-complexity = 10
docstring-convention = numpy
rst-roles = class,const,func,meth,mod,ref
rst-directives = deprecated
15 changes: 11 additions & 4 deletions sparrow-py/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
# Python Library for Kaskada
# Kaskada Timestreams

`sparrow-py` provides a Python library for building and executing Kaskada queries using an embedded Sparrow library.
It uses [Pyo3][pyo3] to generate Python wrappers for the Rust code, and then provides more Python-friendly implementations on top of that.
<!-- start elevator-pitch -->
Kaskada's `timestreams` library makes it easy to work with structured event-based data.
Define temporal queries on event-based data loaded from Python, using Pandas or PyArrow and push new data in as it occurs.
Or, execute the queries directly on events in your data lake and/or as they arrive on a stream.

[pyo3]: https://github.com/PyO3/pyo3
With Kaskada you can unleash the value of real-time, temporal queries without the complexity of "big" infrastructure components like a distributed stream or stream processing system.

Under the hood, `timestreams` is an efficient temporal query engine built in Rust.
It is built on Apache Arrow, using the same columnar execution strategy that makes ...

<!-- end elevator-pitch -->

## Install Python

Expand Down
57 changes: 57 additions & 0 deletions sparrow-py/docs/conf.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,68 @@
"""Sphinx configuration."""
from typing import Any
from typing import Dict

project = "sparrow-py"
author = "Kaskada Contributors"
copyright = "2023, Kaskada Contributors"
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.napoleon",
"sphinx.ext.intersphinx",
"sphinx.ext.todo",
"myst_parser",
]
autodoc_typehints = "description"
html_theme = "furo"
html_title = "Kaskada"
language = "en"

intersphinx_mapping: Dict[str, Any] = {
'python': ('http://docs.python.org/3', None),
'pandas': ('http://pandas.pydata.org/docs', None),
'pyarrow': ('https://arrow.apache.org/docs', None),
}

html_theme_options: Dict[str, Any] = {
"footer_icons": [
{
"name": "GitHub",
"url": "https://github.com/kaskada-ai/kaskada",
"html": """
<svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 16 16">
<path fill-rule="evenodd" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z"></path>
</svg>
""",
"class": "",
},
],
"source_repository": "https://github.com/kaskada-ai/kaskada/",
"source_branch": "main",
"source_directory": "sparrow-py/docs/",
}

# Options for Todos
todo_include_todos = True

# Options for Myst (markdown)
# https://myst-parser.readthedocs.io/en/v0.17.1/syntax/optional.html
myst_enable_extensions = [
"colon_fence",
"deflist",
"smartquotes",
"replacements",
]
myst_heading_anchors = 3

# Suggested options from Furo theme
# -- Options for autodoc ----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#configuration

# Automatically extract typehints when specified and place them in
# descriptions of the relevant function/method.
autodoc_typehints = "description"

# Don't show class signature with the class' name.
autodoc_class_signature = "separated"

autodoc_type_aliases = { 'Arg': 'sparrow_py.Arg' }
50 changes: 48 additions & 2 deletions sparrow-py/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,51 @@
```{include} ../README.md
---
end-before: <!-- github-only -->
hide-toc: true
---

# Kaskada Timestreams

```{include} ../README.md
:start-after: <!-- start elevator-pitch -->
:end-before: <!-- end elevator-pitch -->
```

## Getting Started with Timestreams

Getting started with Timestreams is as simple as `pip` installing the Python library, loading some data and running a query.

```python
import timestreams as t

# Read data from a Parquet file.
data = t.sources.Parquet.from_file(
"path_to_file.parquet",
time = "time",
key = "user")
# Get the count of events associated with each user over time, as a dataframe.
data.count().run().to_pandas()
```

## What are "Timestreams"?
A [Timestream](reference/timestream) describes how a value changes over time. In the same way that SQL
queries transform tables and graph queries transform nodes and edges,
Kaskada queries transform Timestreams.

In comparison to a timeseries which often contains simple values (e.g., numeric
observations) defined at fixed, periodic times (i.e., every minute), a Timestream
contains any kind of data (records or collections as well as primitives) and may
be defined at arbitrary times corresponding to when the events occur.

```{toctree}
:hidden:
quickstart
```

```{toctree}
:caption: Reference
:hidden:
reference/timestream
reference/sources
reference/execution
```
6 changes: 6 additions & 0 deletions sparrow-py/docs/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Quick Start

```{todo}
Write the quick start.
```
11 changes: 11 additions & 0 deletions sparrow-py/docs/reference/execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Execution

There are several ways to execute a Timestream.
The {py:meth}`run method <sparrow_py.Timestream.run>` executes the Timestream over the current data, and produces a {py:class}`Result <sparrow_py.Result>`.

## Result
```{eval-rst}
.. autoclass:: sparrow_py.Result
:exclude-members: __init__
:members:
```
6 changes: 6 additions & 0 deletions sparrow-py/docs/reference/sources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Sources

```{eval-rst}
.. automodule:: sparrow_py.sources
:members:
```
17 changes: 17 additions & 0 deletions sparrow-py/docs/reference/timestream.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Timestream

```{todo}
- [ ] Expand the `Arg` type alias in timestreams accordingly.
```

```{eval-rst}
.. autoclass:: sparrow_py.Timestream
:exclude-members: __init__
:members:
:special-members:
```

```{eval-rst}
.. autoclass:: sparrow_py.Arg
:members:
```
11 changes: 7 additions & 4 deletions sparrow-py/noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,17 +49,20 @@ def check_lint(session: Session) -> None:
"darglint",
"flake8",
"flake8-bugbear",
"flake8-docstrings",
"flake8-rst-docstrings",
"isort",
"pep8-naming",
"pydocstyle",
"pyupgrade",
)
session.run
session.run("black", "--check", *args)
session.run("darglint", "pysrc")
session.run("flake8", *args)
session.run("isort", "--filter-files", "--check-only", *args)

# Only do darglint and pydocstyle on pysrc (source)
session.run("darglint", "pysrc")
session.run("pydocstyle", "--convention=numpy", "pysrc")
# No way to run this as a check.
# session.run("pyupgrade", "--py38-plus")

Expand Down Expand Up @@ -97,7 +100,7 @@ def mypy(session: Session) -> None:
# However, there is a possibility it slows things down, by making mypy
# run twice -- once to determine what types need to be installed, then once
# to check things with those stubs.
session.run("mypy", "--install-types", *args)
session.run("mypy", "--install-types", "--non-interactive", *args)
if not session.posargs:
session.run("mypy", f"--python-executable={sys.executable}", "noxfile.py")

Expand Down Expand Up @@ -172,7 +175,7 @@ def docs(session: Session) -> None:
"""Build and serve the documentation with live reloading on file changes."""
args = session.posargs or ["--open-browser", "docs", "docs/_build"]
install_self(session)
session.install("sphinx", "sphinx-autobuild", "furo", "myst-parser")
session.install("sphinx", "sphinx-autobuild", "furo", "myst-parser", "pandas", "pyarrow", "sphinx-autodoc-typehints")

build_dir = Path("docs", "_build")
if build_dir.exists():
Expand Down
23 changes: 4 additions & 19 deletions sparrow-py/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion sparrow-py/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@ coverage = {extras = ["toml"], version = ">=6.2"}
darglint = ">=1.8.1"
flake8 = ">=4.0.1"
flake8-bugbear = ">=21.9.2"
flake8-docstrings = ">=1.6.0"
flake8-rst-docstrings = ">=0.2.5"
furo = ">=2021.11.12"
isort = ">=5.10.1"
mypy = ">=0.930"
pandas-stubs = "^2.0.2"
pep8-naming = ">=0.12.1"
pyarrow = "^12.0.1"
pydocstyle = "^6.3.0"
pytest = ">=6.2.5"
pyupgrade = ">=2.29.1"
safety = ">=1.10.3"
Expand Down
19 changes: 5 additions & 14 deletions sparrow-py/pysrc/sparrow_py/__init__.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,20 @@
"""Kaskada query builder and local executon engine."""
from typing import Dict
from typing import List
from typing import Union

from . import sources
from ._execution import ExecutionOptions
from ._expr import Expr
from ._result import Result
from ._session import init_session
from ._timestream import Arg
from ._timestream import Timestream
from ._timestream import record
from ._windows import SinceWindow
from ._windows import SlidingWindow
from ._windows import Window


def record(fields: Dict[str, Expr]) -> Expr:
"""Create a record from the given keyword arguments."""
import itertools

args: List[Union[str, "Expr"]] = list(itertools.chain(*fields.items()))
return Expr.call("record", *args)


__all__ = [
"Arg",
"ExecutionOptions",
"Expr",
"Timestream",
"init_session",
"record",
"Result",
Expand Down
1 change: 0 additions & 1 deletion sparrow-py/pysrc/sparrow_py/_execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,3 @@ class ExecutionOptions:

row_limit: Optional[int] = None
max_batch_size: Optional[int] = None

Loading

0 comments on commit 2bb67c1

Please sign in to comment.