Skip to content

um2158/Data-Pipeline-based-on-Messaging-using-Pyspark-and-AirFlow

Repository files navigation

Data-Pipeline-based-on-Messaging-using-Pyspark-and-AirFlow

This project demonstrates building a big data pipeline on AWS at scale. The Covid-19 dataset is streamed real-time from an external UPI using NiFi. The complex JSON data is parsed into CSV using NiFi and the result is stored in HDFS. This data is then sent to Kafka for data processing using PySpark. The processed data will then be consumed from Spark and stored in HDFS. A Hive table is created on top of HDFS and finally the cleaned, transformed data is stored in the data lake and deployed.

Screenshot 2023-06-01 at 10 10 02 AM

Further tasks:

  • Use orchestration other than AirFlow, like Luigi, Prefect, Dagster
  • Create a visualization

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published