Reddit data pull using API and then schedule based on airflow and docker
- Source of the data.
- Data is pulled using HTTP requests.
- Handles Extraction, Transformation, and Loading (ETL) processes.
- Extracts data from the Reddit API.
- Transforms the data (e.g., cleaning, filtering, aggregation).
- Loads the transformed data into an AWS S3 bucket.
- Encapsulates the Python ETL script.
- Ensures consistency and reproducibility across different environments.
- Facilitates easy deployment and management.
- Orchestrates the ETL process.
- Schedules and manages the execution of the Docker container running the ETL script.
- Monitors the ETL workflow and handles retries and failures.
- Stores the processed data.
- Provides scalable and durable storage.