Reddit data pull using API and then schedule based on airflow and docker
- Source of the data.
- Data is pulled using HTTP requests.
- Handles Extraction, Transformation, and Loading (ETL) processes.
- Extracts data from the Reddit API.
- Transforms the data (e.g., cleaning, filtering, aggregation).
- Loads the transformed data into an AWS S3 bucket.
- Encapsulates the Python ETL script.
- Ensures consistency and reproducibility across different environments.
- Facilitates easy deployment and management.
- Orchestrates the ETL process.
- Schedules and manages the execution of the Docker container running the ETL script.
- Monitors the ETL workflow and handles retries and failures.
- Stores the processed data.
- Provides scalable and durable storage.
-
Start with Integration of snowflake to aws using IAM Role. Check Infrastructure Folder.
-
Create External Table . Check Snowflake folder
-
Create Snowpipe in snowflake .
-
Copy sqs arn from snowflake using show pipe to s3 bucket.
-
Try to upload file to s3 bucket .And check Data in tables