Custom Spark-Kafka Cluster
This project sets up a custom Spark-Kafka cluster using Docker and Docker Compose. It includes Apache Spark, Apache Kafka, PostgreSQL, Hive Metastore, LocalStack for S3, Prometheus, and Grafana.
- hive/conf/.hiverc: Hive initialization script to add necessary JAR files.
- hive/conf/hive-site.xml: Configuration file for Hive Metastore.
- hive/conf/jars/: Directory containing necessary JAR files for Hive.
- prometheus/prometheus.yml: Configuration file for Prometheus.
- spark/metrics.properties: Configuration file for Spark metrics.
- spark/run.sh: Script to start Spark Master, Worker, or Submit jobs.
- docker-compose.yml: Docker Compose configuration file to set up the entire cluster.
- Dockerfile: Dockerfile to build the Spark cluster image.
- Docker
- Docker Compose
First, build the Docker image for the Spark cluster:
docker build -t my-spark-cluster:3.5.0 .
Start the cluster using Docker Compose:
docker-compose up
This command will start all the services defined in the docker-compose.yml
file.
- Spark Master: http://localhost:9090
- Spark Worker A: http://localhost:9091
- Spark Worker B: http://localhost:9093
- Kafka: Accessible on port 9092
- S3 (LocalStack): Accessible on port 4566
- PostgreSQL: Accessible on port 5432
- Hive Metastore: Accessible on port 9083
- Spark Thrift Server: Accessible on port 10000
- Grafana: http://localhost:3000
- Prometheus: http://localhost:19090
Prometheus is configured to scrape metrics from the Spark Master, Workers, and Executors.
Grafana is set up to visualize the metrics collected by Prometheus. Access it at http://localhost:3000.
- Ensure that the specified volumes and paths exist and are accessible by Docker.
- Customize the provided configurations as needed for your specific use case.