This repo is used in this publication:
M. M. Aseman-Manzar, S. Karimian-Aliabadi, R. Entezari-Maleki, B. Egger and A. Movaghar, "Cost-Aware Resource Recommendation for DAG-Based Big Data Workflows: An Apache Spark Case Study," in IEEE Transactions on Services Computing, vol. 16, no. 3, pp. 1726-1737, 1 May-June 2023, doi: 10.1109/TSC.2022.3203010. https://ieeexplore.ieee.org/abstract/document/9894699
I also have another repo which is helpful for parsing the history logs:
https://github.com/mohsenasm/Python-Spark-Log-Parser
- First run the cluster and go into the resourcemanager container:
docker-compose -f hadoop-docker-compose.yml up -d && docker-compose -f hadoop-docker-compose.yml exec resourcemanager bash
- Then copy a sample file for the wordcount application:
hdfs dfs -mkdir -p /in/ && hdfs dfs -copyFromLocal /opt/hadoop-3.1.1/README.txt /in/
- Run the wordcount application on the cluster:
yarn jar /opt/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /in /out
- See the output with:
hdfs dfs -cat /out/*
and check hadoop history server on http://localhost:8188
- Then copy a sample file for the wordcount application:
- Remove the cluster:
docker-compose -f hadoop-docker-compose.yml down -v
- First run the cluster:
docker-compose -f spark-client-docker-compose.yml up -d --build
- Then go into the spark container:
docker-compose -f spark-client-docker-compose.yml run -p 18080:18080 spark-client bash
- Start the history server:
setup-history-server.sh
- Run the SparkPi application on the yarn cluster:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples*.jar 3
and see run history on http://localhost:18080 - optional Run the SparkPi application in client mode on the yarn cluster:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client examples/jars/spark-examples*.jar 3
and see run history on http://localhost:18080
- Start the history server:
- Remove the cluster:
docker-compose -f spark-client-docker-compose.yml down -v
- First run the cluster:
docker-compose -f spark-client-with-tpcds-docker-compose.yml up -d --build
- Then go into the tpc-ds container:
docker-compose -f spark-client-with-tpcds-docker-compose.yml run tpc-ds /run.sh bash
/run.sh gen_data 1
/run.sh copy_queries
/run.sh gen_ddl 1
for using parquet/run.sh gen_ddl_csv 1
for using csv
- Then go into the spark container:
docker-compose -f spark-client-with-tpcds-docker-compose.yml run -p 18080:18080 spark-client bash
- Start the history server:
setup-history-server.sh
- Copy data to HDFS:
hdfs dfs -mkdir -p /tpc-ds-files/data/parquet_1 && hdfs dfs -copyFromLocal /tpc-ds-files/data/csv_1 /tpc-ds-files/data/csv_1
- Create parquet tables (for steps 4, 5, 6 and 10):
spark-sql --master yarn --deploy-mode client -f /tpc-ds-files/ddl/tpcds_1.sql --name create_db_scale_1
- optional Run sample query:
spark-submit --master yarn --deploy-mode client /root/scripts/query.py -s 1 -q 'SELECT * from (SELECT count(*) from store_returns)' --name 'query for test database creation'
- (Client Mode) Run a TPC-DS query from pre-generated queries with spark-submit:
spark-submit --master yarn --deploy-mode client /root/scripts/query.py -s 1 -lf /tpc-ds-files/pre_generated_queries/query5.sql --name query5_client
- (Client Mode + spark-sql) Run a TPC-DS query from pre-generated queries with spark-sql:
spark-sql --master yarn --deploy-mode client --conf spark.sql.crossJoin.enabled=true -database scale_1 -f /tpc-ds-files/pre_generated_queries/query26.sql --name query26_cluster
- Create csv tables (for step 8):
spark-sql --master yarn --deploy-mode client -f /tpc-ds-files/ddl/tpcds_1_csv.sql --name create_db_scale_1_csv
- (Client Mode + spark-sql + csv_database) Run a TPC-DS query from pre-generated queries with spark-sql:
spark-sql --master yarn --deploy-mode client --conf spark.sql.crossJoin.enabled=true -database scale_1_csv -f /tpc-ds-files/pre_generated_queries/query26.sql --name query26_cluster
- Copy TPC-DS pre-generated queries to HDFS:
hdfs dfs -mkdir -p /tpc-ds-files/pre_generated_queries && hdfs dfs -copyFromLocal /tpc-ds-files/pre_generated_queries /tpc-ds-files/
- (Cluster Mode) Run a TPC-DS query from pre-generated queries with spark-submit:
spark-submit --master yarn --deploy-mode cluster /root/scripts/query.py -s 1 -hf /tpc-ds-files/pre_generated_queries/query40.sql -hf /tpc-ds-files/pre_generated_queries/query52.sql --name query40_and_query52_cluster
- Start the history server:
- Remove the cluster:
docker-compose -f spark-client-with-tpcds-docker-compose.yml down -v
- Run
python3 run_tpcds.py 1 3 5 10
. Then history will be onhdfs:///spark-history
and on./output/spark-history
in the host. - Remove the cluster:
docker-compose -f spark-client-with-tpcds-docker-compose.yml down -v
- Change directory to the
swarm
directory in root of the project. - Preparations:
- Setup swarm manager with
docker swarm init --advertise-addr <the_manager_ip_address>
. This command will print adocker swarm join
command, copy it. - Run the
join command
into each worker. - On the swarm manager, for each node, assign label
node-id
. (docker node update --label-add node-id=1 node1_hostname
) - Update file
swarm/spark-swarm-client.yml
.
- Setup swarm manager with
- Run swarm cluster with
docker stack deploy -c spark-swarm.yml tpcds
and wait until all services indocker service ls
be running. - Run
ADDITIONAL_SPARK_CONFIG="--num-executors 25 --executor-cores 1 --executor-memory 1G" USE_CSV="False" python3 run_tpcds_on_swarm.py 1 10 20 40 35 70 100 120 135 150
. Then history will be onhdfs:///spark-history
and on./output/spark-history
in the host. - Remove the cluster:
- Remove all services:
docker stack rm tpcds && docker-compose -f spark-swarm-client.yml down -v
- On each nodes:
- wait until
docker ps
prints no services. - execute
docker ps && docker container prune && docker volume prune
and confirmy
.
- wait until
- Remove all services:
docker service create --mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock -p 80:8080 -e PORT=8080 --constraint 'node.role == manager' --name swarm-dashboard mohsenasm/swarm-dashboard
- namenode -> http://localhost:9870
- spark history -> http://localhost:18080
- hadoop history -> http://localhost:8188
- swarm-dashboard -> http://localhost:80