Skip to content

unfiltered-syrup/Big_Data_Final

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project is a demographic analysis of indoor environmental complaints in new york city. 
We have 2 datasets, a NYC environmental complaints dataset and a NYC demographics dataset.

Step1:

DATA INGESTION
To run the code, upload Clean.java, Clean.class, Clean$CleanReducer.class, and Clean$CleanMapper.class
to dataproc and run the following command:
hadoop jar Clean.jar Clean /complaints_data.csv /ingestion_output
Now, you should see the output in multiple text files in the /ingestion_output directory. 
Run the following command to merge the output files into one csv file:
hadoop fs -cat /ingestion_output/part-* | hadoop fs -put - /complaints_data_ingested.csv


Step2:

DATA CLEANING
 
The dataset used is here is "complaints_data_ingested.csv"

All output data can be found as csv files in /, which is the output of the data ingestion process.

To run the code for the environmental complaints dataset, run cleaning_code.scala (located in SourceCode/cleaning) using the command
spark-shell --deploy-mode client -i cleaning_code.scala 

Step3:

The code to clean the demographics dataset can be found in /etl_code/jethro_cheng.
The dataset used here is "demo_2021acs5yr_nta.xlsx"

To run the code for the demographics dataset, upload cleanedAgeDF.scala and cleanedRaceDF.scala.
Then in dataproc, run 
spark-shell --deploy-mode client --packages com.crealytics:spark-excel_2.12:0.13.7
Once you are in spark, run
:load cleanedAgeDF.scala 
or 
:load cleanedRaceDF.scala
-----------------------------------------------------------------------------------------

DATA PROFILING

Step4:

analysis_code.scala:

The input of the analysis is the result of the cleaning process in /cleaning_output
and the output of Jethro's data cleaning process. Name the output of the cleaned complaints data as 
"complaints_data_cleaned.csv" and the output of Jethro's demographic data as "demo_data_cleaned.csv".
Put both files in /

Then, run AnalysisCode.scala (located in SourceCode/analysis) using the command
spark-shell --deploy-mode client -i AnalysisCode.scala
All data output during the analysis process can be found in /analysis_output

Step5:

To run the linear regression analysis (located in SourceCode/analysis) 
First name the output of the previous analysis as dataset_additional_features.csv and run the following command
spark-shell --deploy-mode client -i RegressionAnalysis.scala

After running the regression analysis, you should see a list of coefficients in the console and 
saved as a txt file in your local directory.

Step6:

The code to run the data profiling for the demographics dataset can be found in /profiling_code/jethro_cheng.

To run the profiling code, upload MeanMedianAge.scala and MajorityRace.scala.
Then in dataproc, run 
spark-shell --deploy-mode client --packages com.crealytics:spark-excel_2.12:0.13.7
Once you are in spark, run
:load MeanMedianAge.scala 
or 
:load MajorityRace.scala



About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published