This project is designed to run a Spark application that processes data using Scala and Spark. The project includes an SBT configuration, a Jupyter notebook for development and experimentation, and a shell script for running the Spark job.
build.sbt
: SBT build configuration file.HW4.ipynb
: Jupyter notebook containing the code and documentation for the project.run.sh
: Shell script to package the application and submit it to a Spark cluster.
- Scala 2.12.18
- Spark 2.3.0
- JDK 8 or higher
- SBT (Simple Build Tool) 1.3.13 or higher
- Apache Hadoop/YARN setup for running Spark in cluster mode
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Ensure that SBT is installed. You can download it from sbt official site.
-
Package the project using SBT:
sbt package
-
Ensure that Hadoop/YARN is properly configured and running.
-
Use the provided
run.sh
script to submit the Spark job to the cluster:sh run.sh
The run.sh
script includes the following commands:
```bash
sbt package
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executor.instances=5 \
--conf spark.executor.memory=4g \
--conf spark.executor.cores=2 \
--class Q7 \
~/spark/target/scala-2.11/hw4_2.11-1.0.jar
```
This script will package your application and submit it to a YARN cluster with the specified configurations.
The build.sbt
file includes dependencies and build configurations for the project:
```scala
name := "HW4"
version := "1.0"
scalaVersion := "2.12.18"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "2.3.0",
"org.apache.spark" %% "spark-mllib" % "2.3.0"
)
```
The HW4.ipynb
file contains the main code and analysis for the project. It includes various stages of data processing, analysis, and model building using Spark.
- Data Ingestion: Loading data from HDFS and other data sources into Spark DataFrames.
- Data Cleaning and Preprocessing: Handling missing values, data type conversions, and data transformations.
- Exploratory Data Analysis (EDA): Using Spark SQL to perform descriptive statistics and visualize data distributions.
- Feature Engineering: Creating new features from raw data to improve model performance.
- Model Training: Training machine learning models using Spark MLlib, including data splitting, model selection, and hyperparameter tuning.
- Model Evaluation: Assessing model performance using metrics like accuracy, precision, recall, and F1-score.
- Model Deployment: Saving the trained model and setting up pipelines for batch or real-time predictions.
Contributions are welcome! Please create a pull request or open an issue to discuss changes.
This project is licensed under the MIT License. See the LICENSE
file for more details.