A demo of CrateDB's real-time analytics capabilities.
The solution consists of three major layers running in the AWS Cloud:
Purpose: Simulate and send temperature data events into the system.
-
AWS EC2 Hosts a lightweight data generator script that loads temperature readings from a source dataset. These readings are serialized into JSON events.
-
Event Stream (Amazon MSK / Kafka) The producer publishes these temperature events to a Kafka topic. Kafka provides scalable, fault-tolerant buffering between producers and consumers.
Flow:
EC2 → Kafka (MSK)
Purpose: Listen for new temperature events, process them, and write batches into CrateDB.
-
AWS Lambda Subscribes to the Kafka topic and triggers upon new events. Processes messages and groups them into batches.
-
CrateDB Cloud The Lambda function writes the processed temperature data into CrateDB
Flow:
Kafka → Lambda → CrateDB
Purpose: Visualize the temperature data in real time.
-
AWS EC2 (Frontend Host) Runs a Grafana instance, accessible to users via a browser.
-
Grafana Dashboard Connects directly to CrateDB using its PostgreSQL endpoint. Executes queries to fetch and visualize temperature metrics such as:
- Current and historical readings
- Geo/time-based charts
-
Users Access the Grafana dashboard through a secure web interface.
Flow:
CrateDB → Grafana → Users
We use publicly available data from the Climate Data Store. The ERA5-Land data set includes atmospheric variables, such as air temperature and air humidity from around the globe.
data/parser.py generates a report for a given date range, parses the retrieved NetCDF file, and converts it to JSON documents. Below is an example of the final JSON document.
{
"timestamp": 1756684800000000000,
"temperature": 295.6215515136719,
"latitude": 40.90000000000115,
"longitude": 31.100000000000172
}We retrieve data from the Climate Data Store API and you will need access to their API. Please retrieve your personal CDSAPI key and store it in ~/.cdsapirc:
echo "url: https://cds.climate.copernicus.eu/api
key: INSERT-YOUR-KEY" > ~/.cdsapircWe use Amazon Managed Streaming for Apache Kafka (MSK). Please set up credentials using aws configure.
To run the producer, we need to set up a virtual Python environment and install dependencies:
cd data
python3 -m venv .venv
source .venv/bin/activate
pip3 install -U -r requirements.txtNext, copy the example .env file and adjust values as needed:
cp .env.example .envTo run the producer, simply execute it. If you want to adjust the date range that will be downloaded, please edit the main method accordingly.
python3 producer.pyThe producer has embedded data, by default it will retrieve data from the Climate Data Store for Germany, to change this, update the following line in producer.py:
parser = Parser("DEU")
All ISO_A3 country codes should work here. Some countries are more complex and have many country bounds, some a distance away.
The trigger is configured to fire on a single new Kafka (MSK) record (batch size 1). To restart the ingest from the start, delete the existing trigger and create a new one (it's very straightforward). There is only one option for MSK cluster, make sure authentication is set and to ignore bad test data, set the starting point to be the timestamp 2025-10-14T00:00:00.000Z. Set the topic to dev-1 (or a different topic as needed).
This repository uses black for consistent formatting of Python code. Please check your code before comitting. To set up black, run these steps within your virtual Python environment:
cd data
pip3 install -U -r requirements-dev.txt
black .From the msk-to-crate-python directory, run:
sam build && sam deploy
This will first build, then deploy the lambda code (including any pip dependencies) to AWS Lambda. It will be viewable at:
https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/real-time-demo-app-msktocratepython-zHHmuN8vwScx?subtab=envVars&tab=configure
The sam cli command can be installed with homebrew from aws-sam-cli on macOS.
It creates a hidden directory .aws-sam, or should!
Also, there are several environment variables that need defining against the Lambda, these are:
CRATEDB_DB: CrateDB database name (typicallydoc)CRATEDB_HOST: CrateDB host nameCRATEDB_USER: CrateDB usernameCRATEDB_PASS: CrateDB passwordCRATEDB_PORT: CrateDB port for HTTP, typically 4200SOURCE_TOPIC: Topic name, excluding therecordsfirst part of the path
To generate a deployable ZIP archive containing the Lambda function's code and dependencies, run the corresponding bash script:
cd msk-to-cratedb-python
./create_zip_package.shThe resulting ZIP file can then be used to deploy code to a newly created Lambda function. Make sure it's not extracted and kept as a .ZIP file.
