Skip to content

Commit

Permalink
docs: Use pydata theme and setup ablog (#732)
Browse files Browse the repository at this point in the history
  • Loading branch information
bjchambers authored Sep 2, 2023
1 parent 178c4a1 commit 5cbbbff
Show file tree
Hide file tree
Showing 17 changed files with 945 additions and 255 deletions.
15 changes: 11 additions & 4 deletions .github/workflows/ci_python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,6 @@ jobs:
- uses: actions/setup-python@v4
with:
python-version: |
3.8
3.9
3.10
3.11
Expand Down Expand Up @@ -117,7 +116,7 @@ jobs:
# This installs the kaskada package using the wheel.
# This ensures that we don't accidentally install the version from pypi.
run: |
for V in 3.8 3.9 3.10 3.11; do
for V in 3.9 3.10 3.11; do
echo "::group::Install for Python $V"
poetry env use $V
poetry env info
Expand All @@ -133,14 +132,22 @@ jobs:
echo "::endgroup::"
deactivate
done
- name: Setup QT
# Needed by sphinx-social-cards.
# https://github.com/2bndy5/sphinx-social-cards/blob/main/.github/workflows/build.yml#L54
run: |
sudo apt-get install -y libgl1-mesa-dev libxkbcommon-x11-0
echo "QT_QPA_PLATFORM=offscreen" >> "$GITHUB_ENV"
- name: Build docs
# ablog doesn't currently indicate whether it supports parallel reads,
# leading to a warning.
# when possible, add `"-j", "auto",` to do parallel builds (and in nox).
run: |
sudo apt install -y libegl1
poetry env use 3.11
source $(poetry env info --path)/bin/activate
poetry install --with=docs
pip install ${WHEEL} --force-reinstall
sphinx-build docs/source docs/_build -j auto -W
sphinx-build docs/source docs/_build -W # -j auto
deactivate
- name: Upload docs
uses: actions/upload-pages-artifact@v2
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/release_python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ jobs:
- uses: actions/setup-python@v4
with:
python-version: |
3.8
3.9
3.10
3.11
Expand Down Expand Up @@ -123,7 +122,7 @@ jobs:
run: |
WHEEL="dist/kaskada-${{ needs.version.outputs.version }}-cp38-abi3-${{ matrix.wheel_suffix }}.whl"
echo "WHEEL:${WHEEL}"
for V in 3.8 3.9 3.10 3.11; do
for V in 3.9 3.10 3.11; do
echo "::group::Install for Python $V"
poetry env use $V
source $(poetry env info --path)/bin/activate
Expand Down Expand Up @@ -219,7 +218,6 @@ jobs:
- uses: actions/setup-python@v4
with:
python-version: |
3.8
3.9
3.10
3.11
Expand Down Expand Up @@ -249,7 +247,7 @@ jobs:
run: |
WHEEL="dist/kaskada-${{ needs.version.outputs.version }}-cp38-abi3-manylinux_2_28_${{ matrix.target }}.whl"
echo "WHEEL:${WHEEL}"
for V in 3.8 3.9 3.10 3.11; do
for V in 3.9 3.10 3.11; do
echo "::group::Install for Python $V"
poetry env use $V
poetry env info
Expand Down
46 changes: 46 additions & 0 deletions python/docs/source/_layouts/default.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
layers:
# the base layer for the background
- background:
color: "#26364a"
image: >-
#% if page.meta.card_image -%#
'{{ page.meta.card_image }}'
#%- elif layout.background_image -%#
'{{ layout.background_image }}'
#%- endif %#
# the layer for the logo image
- size: { width: 300, height: 83 }
offset: { x: 60, y: 60 }
icon:
image: "_static/kaskada-negative.svg"
# the layer for the page's title
- size: { width: 920, height: 300 }
offset: { x: 60, y: 180 }
typography:
content: >-
#% if page.meta.title -%#
'{{ page.meta.title }}'
#%- elif page.title -%#
'{{ page.title }}'
#%- endif %#
line:
# height: 0.85
amount: 3
font:
weight: 500
color: white
# the layer for the site's (or page's) description
- offset: { x: 60, y: 480 }
size: { width: 1080, height: 90 }
typography:
content: >-
#% if page.meta and page.meta.description -%#
'{{ page.meta.description }}'
#%- else -%#
'{{ config.site_description }}'
#%- endif %#
line:
height: 0.87
amount: 2
align: start bottom
color: white
10 changes: 10 additions & 0 deletions python/docs/source/blog/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Blog

```{eval-rst}
.. postlist::
:list-style: circle
:format: {title}
:excerpts:
:sort:
:expand: Read more ...
```
76 changes: 76 additions & 0 deletions python/docs/source/blog/posts/2023-03-28-announcing-kaskada-oss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
blogpost: true
author: ben
date: 2023-03-28
tags: releases
excerpt: 1
description: From Startup to Open Source Project
---

# Announcing Kaskada OSS

Today, we’re announcing the open-source release of Kaskada – a modern, open-source event processing engine.

# How it began: Simplifying ML

Kaskada technology has evolved a lot since we began developing it three years ago. Initially, we were laser focused on the machine learning (ML) space. We saw many companies working on different approaches to the same ML problems -- managing computed feature values (what is now called a feature store), applying existing algorithms to train a model from those values, and serving that model by applying it to computed feature values. We saw a different problem.

With our background in the data processing space we identified a critical gap -- no one was looking at the process of going from raw, event-based, data to computed feature values. This meant that users had to choose – use SQL and treat the events as a table, losing important information in the process, or use lower-level data pipeline APIs and worry about all the details. Our experience working on data processing systems at Google and as part of Apache Beam led us to create a compute engine designed for the needs of feature engineering — we called it a feature engine.

We are extremely proud of where Kaskada technology is today. Unlike a feature store, it focuses on computing the features a user described using a simple, declarative language. Unlike existing data processing systems, it delivers on the needs of machine learning – expressing sophisticated, temporal features without leakage, working with raw events without pre-processing, and scalability that just worked for training and serving.

The unique characteristics of Kaskada make it ideal for the time-based event processing required for accurate, real-time machine learning. While we see that ML will always be a great use case for Kaskada, we’ve realized it can be used for so much more.

# Modern, Open-Source Event Processing

When [DataStax acquired Kaskada](https://www.datastax.com/press-release/datastax-acquires-machine-learning-company-kaskada-to-unlock-real-time-ai) a few months ago, we began the process of open-sourcing the core Kaskada technology. In the conversations that followed, we realized that the capabilities of Kaskada that make it ideal for real-time ML – easy to use, high-performance columnar computations over event-based data – also make it great for general event processing. These features include:

1. **Rich, Temporal Operations**: The ability to easily express computations over time beyond windowed aggregations. For instance, when computing training data it was often necessary to compute values at a point in time in the past and combine those with a label value computed at a later point in time. This led to a powerful set of operations for working with time.
2. **Events all the way down**: The ability to run a query both to get all results over time and just the final results. This means that Kaskada operates directly on the events – turning a sequence of events into a sequence of changes, which may be observed directly or materialized to a table. By treating everything as events, the temporal operations are always available and you never need to think about the difference between streams and tables, nor do you need to use different APIs for each.
3. **Modern and easy to use**: Kaskada is built in Rust and uses Apache Arrow for high-performance, columnar computations. It consists of a single binary which makes for easy local and cloud deployments.


This led to the decision to open source Kaskada as a modern, open-source event-processing language and native engine. Machine learning is still a great use case of Kaskada, but we didn’t want the feature engine label to constrain community creativity and innovation. It’s all available today in the [GitHub repository](https://github.com/kaskada-ai/kaskada) under the Apache 2.0 License.

# Why use Kaskada?

Kaskada is for you if…

1. **You want to compute the results of your query over time.**
Operating over time all the way down means that Kaskada makes it easy to compute the result of any query over time.

2. **You want to express temporal computations without writing pages of SQL.**
Kaskada provides a declarative language for event-processing. Because of the focus on temporal computations and composability, it is much easier and shorter than comparable SQL queries.

3. **You want to process events today without setting up other tools.**
The columnar event-processing engine within Kaskada scales to X million events/second running on a single machine. This lets you get started and iterate quickly without becoming an expert in cluster management or big-data tools.


# What’s coming next?

Our first goal was getting the project released. Now that it is, we are excited to see where the project goes!

Some improvements on our mind are shown below. We look forward to hearing your thoughts on what would help you process events.

1. **Increase extensibility and participate in the larger open-source community.**
- Introduce extension points for I/O connectors and contribute connectors for a larger set of supported formats.
- Expose a logical execution plan after the language constructs have been compiled away, so that other executors may be developed using the same parsing and type-checking rules.
- Introduce extension points for custom schema catalogs, allowing Kaskada queries to be compiled against existing data catalogs.

2. **Align query capabilities with more general, event-processing use cases.**
- Ability to create composite events from patterns of existing events and subsequently process those composite events (“CEP”).
- Improvements to the declarative language to reduce surprises, make it more familiar to new users, and make it even easier to express temporal computations over events.

3. **Continue to improve local performance and usability.**
- Make it possible to use the engine more easily in a variety of ways – via a command line REPL, via an API, etc.
- Improve performance and latency of real-time and partitioned execution within the native engine.

# How can I contribute?

Give it a try – [download one of the releases](https://github.com/kaskada-ai/kaskada/releases) and run some computations on your event data. Let us know how it works for you, and what you’d like to see improved!

We’d love to hear what you think - please comment or ask on our [Kaskada GitHub discussions page](https://github.com/kaskada-ai/kaskada/discussions).

Help spread the word – Star and Follow the project on GitHub!

Please file issues, start discussions or join us on GitHub to chat about the project or event-processing in general.
85 changes: 85 additions & 0 deletions python/docs/source/blog/posts/2023-08-25-new-kaskada.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
blogpost: true
date: 2023-08-25
author: ryan
tags: releases
excerpt: 2
description: Embedded in Python for accessible Real-Time AI
---

# Introducing the New Kaskada

We started Kaskada with the goal of simplifying the real-time AI/ML lifecycle, and in the past year AI has exploded in usefulness and accessibility. Generative models and Large Language Models (LLMs) have revolutionized how we approach AI. Their accessibility and incredible capabilities have made AI more valuable than it has ever been and democratized the practice of AI.

Still, a challenge remains: building and managing real-time AI applications.

## The Challenge of using Real-Time Data in AI Applications

Real-time data for AI Applications has always been surrounded by an array of challenges. For example:

1. **Infrastructure Hurdles**: Accessing real-time data often means struggling to acquire data and deploying complex infrastructure, requiring significant time and expertise to get right.

2. **Cumbersome Tools**: Traditional tools for streaming data are bulky, with steep learning curves and complex JVM-based setups.

3. **Analysis Disconnect**: AI models thrive on historical data, but the tools designed for bulk historical analysis are often worlds apart from those made for real-time or streaming data processing.

4. **Challenges of Time-Travel**: AI applications frequently require a unique kind of historical analysis – one that can time-travel through your data. Expressing such analyses is challenging with conventional analytic tools that weren’t designed with time in mind.

These challenges have made it difficult for all but the largest companies with the deepest development budgets to deliver on the promise of real-time AI, and these are the challenges we built Kaskada to solve.

## Welcome to the New Kaskada

We originally built Kaskada as a managed service. Earlier this year, we [released Kaskada as an open-source, self-managed service](./2023-03-28-announcing-kaskada-oss.md), simplifying data onboarding and allowing Kaskada to be deployed anywhere.

Today, we take the next step in improving Kaskada’s usability by providing its core compute engine as an embedded Python library. Because Kaskada is written in Rust, we’re able to leverage the excellent [PyO3](https://pyo3.rs/) project to compile Python-native bindings for our compute engine and support Python-defined UDF’s. Additionally, Kaskada is built using [Apache Arrow](https://arrow.apache.org/), which allows zero-copy data transfers between Kaskada and other Python libraries such as [Pandas](https://pandas.pydata.org/), allowing you to operate on your data in-place.

We’re also changing how you query Kaskada by implementing our query DSL as Python functions. This change makes it easier to get started by eliminating the learning curve of a new language and improving integration with code editors, syntax highlighters, and AI coding assistants.

The result is an easy-to-use Python-native library with all the efficiency and performance of our low-level Rust implementation, fully integrated with the rich Python ecosystem of AI/ML tools, visualization libraries etc.

## Features for Real-Time AI Applications

Real-Time AI is easier today than it's ever been:

* Foundation models built by OpenAI, Facebook and others can be used as a starting point, allowing sophisticated applications to be built with a fraction of the data that would otherwise be necessary.
* Services such as OpenAI eliminate the need to manage complex infrastructure.
* Platforms like HuggingFace have made it easier than ever to share and collaborate on open LLMs.

The New Kaskada complements these resources, making it easier than ever to utilize real-time data by providing several key components:

### 1. Real-time Aggregation

In a world where data is continuously flowing, being able to efficiently precompute model inputs is invaluable. With Kaskada's real-time aggregation, you can effortlessly:

- Connect with multiple data streams using our robust data connectors.
- Transform data on-the-go, ensuring that the model receives the most relevant inputs.
- Perform complex aggregations to derive meaningful insights from streams of data, making sure your AI models always have the most pertinent information.
- Pause and resume aggregations in the event of process termination.

The result? Faster decision-making, timely insights, and AI models that are always a step ahead.

### 2. Event Detection

Real-time event detection can mean the difference between catching an anomaly and letting it slip through the cracks. The New Kaskada’s event detection system is designed to:

- Expressively describe complex cross-event and cross-entity conditions to use as triggers.
- Identify important activities and patterns as they occur, ensuring nothing goes unnoticed.
- Trigger proactive AI behaviors, allowing for immediate actions or notifications based on the detected events.

From spotting fraudulent activities to identifying high-priority user behaviors, Kaskada ensures that important activities are always on your radar.

### 3. History Replay

Past data holds the keys to effective future decisions. With Kaskada's history replay, you can:

- Backtest AI models by revisiting historical data points.
- Fine-tune models using per-example time travel, ensuring your models are always optimized based on past and present data.
- Use point-in-time joins to seamlessly merge data from different data sources at a single point in history, unlocking deeper insights and more accurate predictions.

Kaskada ties together the modern real-time AI stack, providing a data foundation for developing and operating AI applications.

## Join the Community

We believe in the transformative power of real-time AI and the possibilities it holds. We believe that real-time data will allow AI to go beyond question-answering to provide proactive, intelligent applications. We want to hear what excites you about real-time and generative AI - [Join our Slack community](https://kaskada.io/community/) and share your use cases, insights and experiences with the New Kaskada.

*"Real-Time AI without the fuss."* Embrace the future with Kaskada.
Loading

0 comments on commit 5cbbbff

Please sign in to comment.