Skip to content

Commit

Permalink
0.1.2
Browse files Browse the repository at this point in the history
  • Loading branch information
marsupialtail committed Oct 29, 2022
1 parent d6311b9 commit 4988cb1
Show file tree
Hide file tree
Showing 31 changed files with 437 additions and 190 deletions.
2 changes: 1 addition & 1 deletion docs/docs/cloud.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Setting up Quokka for EC2

To use Quokka for EC2, you need to (at minimum) have an AWS account with permissions to launch instances and create new security groups. You will probably run into issues since everybody's AWS setup is a little bit different, so please email: [email protected] or [Discord](https://discord.gg/YKbK2TVk).
To use Quokka for EC2, you need to (at minimum) have an AWS account with permissions to launch instances and create new security groups. You will definitely run into issues since everybody's AWS setup is a little bit different, so please email: [email protected] or [Discord](https://discord.gg/6ujVV9HAg3).

Quokka requires a security group that allows inbound and outbound connections to ports 5005 (Flight), 6379 (Ray) and 6800 (Redis) from IP addresses within the cluster. For simplicity, you can just enable all inbound and outbound connections from all IP addresses. The easiest way to make this is to manually create an instance on EC2 through the dashboard, e.g. t2.micro, and manually add rules to the security group EC2 assigns that instance. Then you can either copy that security group to a new group, or keep using that modified security group for Quokka. There must be an automated way to do this in the AWS CLI, but I am too lazy to figure it out. If you want to tell me how to do it, I'll post the steps here and buy you a coffee.

Expand Down
40 changes: 40 additions & 0 deletions docs/docs/different.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#How is Quokka different from ... ?

##Spark

First I have to say Matei is somewhat of a God, and Spark's design choices are simply ingenious in many cases. Most of its ingenuity is not apparent until you try to design your own system to beat its performance, which I had the good fortune of stumbling upon doing.

Now that I have paid homage to my forebearers, let me say that Quokka and Spark are very similar in terms of what they do, but there are some important differences. Spark's core abstraction is a collection of data partitions. You operate on those data partitions in stages. One stage must complete before the next one starts. Quokka's core abstraction is a stream of data partitions. You can consume a data partition as soon as it's produced. As a result, multiple "stages" can be overlapped and pipelined in Quokka, leading to higher performance.

![Quokka Stream](tpch-parquet.svg)

Quokka's DataStream API resembles Spark's DataFrame API, however is not feature complete yet. Importantly, Quokka doesn't yet support SQL input, though it will in the near future. Like Spark, Quokka's API is lazy. Like Spark, Quokka has a logical plan optimizer, though it is truly a baby compared to the gorilla-sized Spark Catalyst Optimizer.

Quokka is written in Python completely on top of Ray, and integrates with Ray Data. I am collaborating with Ray Data team. If you are **running complicated Python UDFs, your SQL-ish pipeline doesn't fit Spark well (e.g. time series/feature engineering workloads) or already use Ray**, Quokka might be worth keeping on your radar.

Quokka is not fault tolerant, though it will be by the end of 2022. This is how I intend to be collecting my PhD, so you can be pretty darn sure it will happen.

Finally, Quokka is written by one Stanford PhD student, while Spark has billions of dollars behind it. Obviously Quokka in its current state doesn't seek to displace Spark.

Eventually, Quokka aims to be synergistic to Spark by supporting workloads the SparkSQL engine doesn't do too well, like time series or feature backfilling, on the same data lake based on open-source formats like Parquet. Quokka can do these a lot more efficiently due to its streaming-execution model and Python-based flexibility.

##Modin/Dask

Quokka is a lot faster, or aims to be. I don't have benchmark numbers here, though I have found these systems to be slower than Spark.

On the other hand, Quokka does not aim to support things like machine learning training (Dask), or dataframe pivots (Modin). Quokka also doesn't seek to religiously obey the Pandas API, whose eager execution model I think is incompatible with performance in modern systems. Dr. Petersohn will say Quokka then doesn't offer a "dataframe" API. I agree -- that's not Quokka's goal.

##Pandas/Polars/DuckDB

You should be using these solutions if you have less than 100GB of data. Pandas is the starter pack for data scientists, but I really encorage people to check out Polars, which is a Rust/Arrow-based implementation with pretty much the same API that's **A LOT FASTER**. I sponsor Polars on Github, and maybe you should too. Of course if all you want to do is SQL, then DuckDB can be a good choice.

Quokka is heavily integrated with Polars. Indeed in Quokka, if you attempt to read a data source with less than 10MB of data, it will be materialized directly as a Polars Dataframe because that's probably what you want to do anyways. Quokka's core abstraction is simply a stream of Polars Dataframes.


##Ray Data/DaFt/PetaStorm

Recently there has been several attempts to bring data lake computing to unstructured datasets like images or natural language. Most prominent are probably DaFt by Eventual AI and PetaStorm by Uber. They define their own extension types for unstructured data, and try to make executing machine learning models in data pipelines efficient.

Although you can certainly use Quokka to do what those libraries do, Quokka does not focus on this application. Instead Quokka seeks to integrate with those libraries by handling the upstream structured data ETL, like joining feature tables to observations tables etc.

Of course, if your architecture is such that you are using a separate inference server with its own compute resources to conduct the machine learning, and all you have to do in your data pipeline is making RPC calls, then Quokka can definitely fulfill your needs for "unstructured ETL". Quokka just doesn't prioritize executing these deep learning functions natively inside your data pipeline.
10 changes: 3 additions & 7 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,9 @@ What's even better than being cheap and fast is the fact that since Quokka is Py

Another great advantage is that a streaming data paradigm is more in line with how data arrives in the real world, making it easy to bridge your data application to production, or conduct time-series backfilling on your historical data.

You develop with Quokka locally, and deploy to cloud (currently AWS) with a single line of code change. Quokka is specifically designed for the following workloads.
You develop with Quokka locally, and deploy to cloud (currently AWS) with a single line of code change. Quokka is currently designed to target **SQLish data engineering workloads on data lake.** You can try Quokka if you want to speed up some Spark data jobs, or if you want to implement "stateful Python UDFs" in your SQL pipeline, which is kind of a nightmare in Spark. (e.g. forward computing some feature based on historical data) Quokka can also typically achieve much better performance than Spark on pure SQL workloads when input data comes from cloud storage, in Parquet or CSV foramt.

1. **SQLish data engineering workloads on data lake.** You can try Quokka if you want to speed up some Spark data jobs, or if you want to implement "stateful Python UDFs" in your SQL pipeline, which is kind of a nightmare in Spark. (e.g. forward computing some feature based on historical data) Quokka can also typically achieve much better performance than Spark on pure SQL workloads when input data comes from cloud storage, especially if the data is in CSV format.

**The drawback is Quokka currently does not support SQL interface, so you are stuck with a dataframe-like DataStream API.** However SQL optimizations such as predicate pushdown and early projection are implemented.

2. (support forthcoming) **ML engineering pipelines on large unstructured data datasets.** Since Quokka is Python-native, it interfaces perfectly with the Python machine learning ecosystem. **No more JVM troubles.** Unlike Spark, Quokka also will let you precisely control the placement of your stateful operators on machines, preventing GPU out-of-memory and improving performance by reducing contention. Support for these workloads are still in the works. If you are interested, please drop me a note: [email protected] or [Discord](https://discord.gg/YKbK2TVk).
**The drawback is Quokka currently does not support SQL interface, so you are stuck with a dataframe-like DataStream API.** However SQL optimizations such as predicate pushdown and early projection are implemented. I also plan to support Delta Lake and Apache Iceberg in the near future. If you are interested, please drop me a note: [email protected] or [Discord](https://discord.gg/6ujVV9HAg3).

## Roadmap

Expand All @@ -32,4 +28,4 @@ You develop with Quokka locally, and deploy to cloud (currently AWS) with a sing
4. **Time Series Package.** Quokka will support point-in-time joins and asof joins natively by Q4 2022. This will be useful for feature backtesting, etc.

## Contact
If you are interested in trying out Quokka or hit any problems (any problems at all), please contact me at [email protected] or [Discord](https://discord.gg/YKbK2TVk). I will try my best to make Quokka work for you.
If you are interested in trying out Quokka or hit any problems (any problems at all), please contact me at [email protected] or [Discord](https://discord.gg/6ujVV9HAg3). I will try my best to make Quokka work for you.
14 changes: 3 additions & 11 deletions docs/docs/install.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,17 @@
# Installation

If you plan on trying out Quokka for whatever reason, I'd love to hear from you. Please send an email to [email protected] or join the [Discord](https://discord.gg/YKbK2TVk).
If you plan on trying out Quokka for whatever reason, I'd love to hear from you. Please send an email to [email protected] or join the [Discord](https://discord.gg/6ujVV9HAg3).

Quokka can be installed as a pip package:
~~~bash
pip3 install pyquokka
~~~

**However it needs the latest version of Redis (at least 7.0)**, which you can get by running the following:
~~~bash
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list

sudo apt-get update
sudo apt-get install redis
~~~
**Please note that Quokka has problems on Mac M1 laptops. It is tested to work on x86 Ubuntu environments.**

If you only plan on running Quokka locally, you are done. Here is a [10 min lesson](simple.md) on how it works.

If you are planning on reading files from S3, you need to install the awscli and you have your credentials set up.
If you are planning on reading files from S3, you need to install the awscli and have your credentials set up.

If you plan on using Quokka for cloud by launching EC2 clusters, there's a bit more setup that needs to be done. Currently Quokka only provides support for AWS. Quokka provides a utility library under `pyquokka.utils` which allows you to manager clusters and connect to them. It assumes that awscli is configured locally and you have a keypair and a security group with the proper configurations. To set these things up, you can follow the [AWS guide](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html).

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/tpch-parquet.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,14 @@ nav:
- Cartoons: started.md
- Installation: install.md
- Setting Up Cloud Cluster: cloud.md
- How is Quokka different from ...?: different.md
- Tutorials:
- DataStream API: simple.md
- TaskGraph API: tutorial.md
- Dataframe API reference:
- QuokkaContext: quokka_context.md
- DataStream: datastream.md


theme: readthedocs

4 changes: 4 additions & 0 deletions docs/site/404.html
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@
<li class="toctree-l1"><a class="reference internal" href="/cloud/">Setting Up Cloud Cluster</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="/different/">How is Quokka different from ...?</a>
</li>
</ul>
<p class="caption"><span class="caption-text">Tutorials</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="/simple/">DataStream API</a>
Expand Down
4 changes: 4 additions & 0 deletions docs/site/api/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@
<li class="toctree-l1"><a class="reference internal" href="../cloud/">Setting Up Cloud Cluster</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../different/">How is Quokka different from ...?</a>
</li>
</ul>
<p class="caption"><span class="caption-text">Tutorials</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../simple/">DataStream API</a>
Expand Down
10 changes: 7 additions & 3 deletions docs/site/cloud/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../different/">How is Quokka different from ...?</a>
</li>
</ul>
<p class="caption"><span class="caption-text">Tutorials</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../simple/">DataStream API</a>
Expand Down Expand Up @@ -97,7 +101,7 @@
<div class="section" itemprop="articleBody">

<h1 id="setting-up-quokka-for-ec2">Setting up Quokka for EC2</h1>
<p>To use Quokka for EC2, you need to (at minimum) have an AWS account with permissions to launch instances and create new security groups. You will probably run into issues since everybody's AWS setup is a little bit different, so please email: [email protected] or <a href="https://discord.gg/YKbK2TVk">Discord</a>. </p>
<p>To use Quokka for EC2, you need to (at minimum) have an AWS account with permissions to launch instances and create new security groups. You will definitely run into issues since everybody's AWS setup is a little bit different, so please email: [email protected] or <a href="https://discord.gg/6ujVV9HAg3">Discord</a>. </p>
<p>Quokka requires a security group that allows inbound and outbound connections to ports 5005 (Flight), 6379 (Ray) and 6800 (Redis) from IP addresses within the cluster. For simplicity, you can just enable all inbound and outbound connections from all IP addresses. The easiest way to make this is to manually create an instance on EC2 through the dashboard, e.g. t2.micro, and manually add rules to the security group EC2 assigns that instance. Then you can either copy that security group to a new group, or keep using that modified security group for Quokka. There must be an automated way to do this in the AWS CLI, but I am too lazy to figure it out. If you want to tell me how to do it, I'll post the steps here and buy you a coffee.</p>
<p>You also need to generate a pem key pair. The easiest way to do this, again, is to start a t2.micro in the console and using the dashboard. Save the pem key somewhere and write down the absolute path.</p>
<p>After you have the security group and you can use the <code>QuokkaClusterManager</code> in <code>pyquokka.utils</code> to spin up a cluster. The code to do this:</p>
Expand All @@ -119,7 +123,7 @@ <h1 id="setting-up-quokka-for-ec2">Setting up Quokka for EC2</h1>
</div><footer>
<div class="rst-footer-buttons" role="navigation" aria-label="Footer Navigation">
<a href="../install/" class="btn btn-neutral float-left" title="Installation"><span class="icon icon-circle-arrow-left"></span> Previous</a>
<a href="../simple/" class="btn btn-neutral float-right" title="DataStream API">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../different/" class="btn btn-neutral float-right" title="How is Quokka different from ...?">Next <span class="icon icon-circle-arrow-right"></span></a>
</div>

<hr/>
Expand Down Expand Up @@ -149,7 +153,7 @@ <h1 id="setting-up-quokka-for-ec2">Setting up Quokka for EC2</h1>
<span><a href="../install/" style="color: #fcfcfc">&laquo; Previous</a></span>


<span><a href="../simple/" style="color: #fcfcfc">Next &raquo;</a></span>
<span><a href="../different/" style="color: #fcfcfc">Next &raquo;</a></span>

</span>
</div>
Expand Down
4 changes: 4 additions & 0 deletions docs/site/datastream/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@
<li class="toctree-l1"><a class="reference internal" href="../cloud/">Setting Up Cloud Cluster</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../different/">How is Quokka different from ...?</a>
</li>
</ul>
<p class="caption"><span class="caption-text">Tutorials</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../simple/">DataStream API</a>
Expand Down
Loading

0 comments on commit 4988cb1

Please sign in to comment.