A naive data visualization and analysis tool for F1 on board telemetry data.
Table of Contents
In both minor motorsport categories and racing e-sports there seems to be no easily accessible tool to collect, visualize and analyze live telemetry data. The user often has to perform complex installation tasks to run these tools on his own machine, which might not be powerful enough to handle real-time data stream analysis.
This work proposes a possible baseline architecture to implement a data visualization and analysis tool for on-board telemetry data, completely based on cloud technologies and distributed systems. The proposed system falls under the Software-as-a-Service (SaaS) paradigm and relies on Infrastructure-as-a-Service (IaaS) cloud solutions to provide hardware support to its software components.
For more info, please refer to the Project report.
This section lists all major frameworks/libraries used in this project.
Data source and front-end:
Back-end Apache services:
To get your system up and running, follow these simple steps.
First, you need to have an account on any cloud platform from which you can access cluster services. We used Google Cloud Dataproc clusters, but any other cloud provider should do.
Following the next section, this is the architecture you will end up with.
Make sure to have two clusters on which you can deploy the following technologies:
- Apache ZooKeeper (v. 3.7.1) and Apache Kafka (v. 3.1.0) on one cluster.
- Apache Spark (v. 3.1.2) on the other cluster.
-
ZooKeeper is required in order to run Kafka. The following example shows how to properly setup on each cluster node the
zoo.cfg
file in theconf
directory under the ZooKeeper home, to run a ZooKeeper ensemble over a three-nodes cluster:ticktime=2000 dataDir=/var/lib/zookeeper clientPort=2181 initLimit=20 syncLimit=5 server.1=hostnameA:2888:3888 server.2=hostnameB:2888:3888 server.3=hostnameC:2888:3888
-
On each cluster node, the following key properties must be specified in the
server.properties
file, located in theconfig
directory under the Kafka home.broker.id=UID
(where UID is a unique ID for this broker).listeners=PLAINTEXT://internalIP:9092
advertised.listeners=PLAINTEXT://externalIP:9092
zookeeper.connect=hostnameA:2181,hostnameB:2181,hostnameC:2181/kafka_root_znode
-
If you're using Google Cloud Dataproc clusters, you don't need to manually install and configure Spark as it is already included in the cluster's VM image.
Before launching the streamlit client, make sure that:
- Both Kafka and Spark clusters are up and running.
- Specify the correct broker IPs and topic names in
configuration.ini
. - The data source is active and publishing on the correct Kafka topic. For test purposes, you could run the data stream producer process provided in this repo:
python ./datastream_producer.py
- Start the Spark streaming analysis script on the spark cluster:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 ./structured_stream_process.py --broker <IP:port> --intopic <topicName> --outtopic <topicName>
Finally, you are ready to run the client:
streamlit run ./main.py
These are some of the features we would like to add to this project.
- Add anomaly threshold real-time choice
- Multidriver support (this involves kafka topics re-organization)
- Add statefulness to streamlit
- Counter variables
- Data dict
- Use MLlib into the Spark SS data analysis module
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Andrea Lombardi - Linkedin
- Vincenzo Silvio - Linkedin
- Ciro Panariello - Linkedin
- Vincenzo Capone - Linkedin
Thanks to O'Reilly books about:
Infrastructure-as-a-Service used for this project: