Data-Pipeline-based-on-Messaging-using-Pyspark-and-AirFlow

This project demonstrates building a big data pipeline on AWS at scale. The Covid-19 dataset is streamed real-time from an external UPI using NiFi. The complex JSON data is parsed into CSV using NiFi and the result is stored in HDFS. This data is then sent to Kafka for data processing using PySpark. The processed data will then be consumed from Spark and stored in HDFS. A Hive table is created on top of HDFS and finally the cleaned, transformed data is stored in the data lake and deployed.

Further tasks:

Use orchestration other than AirFlow, like Luigi, Prefect, Dagster
Create a visualization

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docker_exp		docker_exp
.DS_Store		.DS_Store
Copy Important folder from local to ec2 and give required permissions.docx		Copy Important folder from local to ec2 and give required permissions.docx
Create EC2 Instance.docx		Create EC2 Instance.docx
Envirnoment setup with docker.docx		Envirnoment setup with docker.docx
Port Forwarding to access services locally.txt		Port Forwarding to access services locally.txt
README.md		README.md
SSH into EC2 Instance.txt		SSH into EC2 Instance.txt
To connect to different services locally after port forwarding.docx		To connect to different services locally after port forwarding.docx
datapipeline_1.pem		datapipeline_1.pem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Pipeline-based-on-Messaging-using-Pyspark-and-AirFlow

Further tasks:

About

Releases

Packages

Languages

um2158/Data-Pipeline-based-on-Messaging-using-Pyspark-and-AirFlow

Folders and files

Latest commit

History

Repository files navigation

Data-Pipeline-based-on-Messaging-using-Pyspark-and-AirFlow

Further tasks:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages