Repository of the code produced for the Scalable and Cloud Programming course.
- solutions folder: all the explored solutions
- resources folder: reduced version of the dataset used to locally compare the solutions
- data_analysis folder: Python scripts used to do data analysis
The program can be used with four parameters:
- the path of the input file
- the path of the output folder
- the solution code
- the number of nodes
The only solution that supports the number_of_nodes param is BestSolutionWithPartitions.
All other solutions will ignore the parameter and so it can be omitted.
Each explored solution has a solution_id that can be used as a parameter when running the project.
The available solutions are:
Solution Id | Solution Name |
---|---|
0 | FirstSolution |
1 | GroupByKey |
2 | NewPairsMapping |
3 | MergeTwoStages |
4 | MergeTwoStagesNoFilter |
5 | BestSolutionWithPartitions |
- Create a new project on Google Cloud Platform
- Enable Dapaproc and the APIs
- Create and download the service account keys
- Export useful variables:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json" \ export PROJECT=<project-id> \ export BUCKET_NAME=<bucket-name> \ export CLUSTER=<cluster-name> \ export REGION=<region>
- Clone this repository:
git clone https://github.com/taglioIsCoding/Co-purchaseAnalysis/
- Initialize the Google Cloud project:
gcloud init
- Create the bucket:
gcloud storage buckets create gs://${BUCKET_NAME} --location=${REGION}
- Create the cluster:
- With a single node
gcloud dataproc clusters create ${CLUSTER} \ --project=${PROJECT} \ --region=${REGION} \ --single-node \ --master-boot-disk-size 240
- With multiple nodes
gcloud dataproc clusters create ${CLUSTER} \ --region=${REGION} \ --num-workers=<number-of-workers> \ --master-boot-disk-size 240 \ --worker-boot-disk-size 240 \ --project=${PROJECT}
- Put the input file in the bucket:
gcloud storage cp </path/to/dataset.csv> gs://${BUCKET_NAME}/input/input.csv
- Build the project:
sbt clean package
- Put the project jar in the bucket:
gcloud storage cp ./target/scala-2.12/copurchaseanalysis_2.12-0.1.0-SNAPSHOT.jar gs://${BUCKET_NAME}/scala/project.jar
- Submit a job:
You can select any available solution, the best one is number 5.
gcloud dataproc jobs submit spark --cluster=${CLUSTER} \ --class=Main \ --jars=gs://${BUCKET_NAME}/scala/project.jar \ --region=${REGION} \ -- gs://${BUCKET_NAME}/input/input.csv gs://${BUCKET_NAME}/output/ <solution-id> <number-of-nodes>
- Delete the cluster:
gcloud dataproc clusters delete ${CLUSTER} --region=${REGION} --project=${PROJECT}