big-data-project

Part I Script

The part one script can be run by using the following command

spark-submit validation_script.py PATH-TO-FILE

where PATH-TO-FILE is the path to the input data csv file. The outputs will be stored as partitioned text files in directories, with each directory representing one column named in the following convention: col_col_id_col_name

The dataset is available for download at https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i

Park II Data Analysis

Most of our work are done in the format of Jupyter Notebook using PySpark, Pandas, and a variety of other libraries for better visualization and easier understanding. To run these notebooks, we launched an EMR cluster with Spark build on AWS and set up notebook connection to the master node via the tutorial at https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/. These notebooks can then be ran like your typical notebook with Internet connection.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
figure		figure
notebooks		notebooks
scripts		scripts
README.md		README.md
validation_script.py		validation_script.py
validators.py		validators.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

big-data-project

Part I Script

Park II Data Analysis

About

Releases

Packages

Contributors 3

Languages

qcmgrt/big-data-project-budae-jjigae

Folders and files

Latest commit

History

Repository files navigation

big-data-project

Part I Script

Park II Data Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages