Skip to content

Fetch news articles about a topic and transfer them to S3. Wrapped up as a Lambda handler.

Notifications You must be signed in to change notification settings

benco5/news-to-S3-lambda

Repository files navigation

News to S3 Lambda

News to S3 Lambda is an ETL (Extract, Transform, Load) pipeline that fetches news articles about a particular topic using the NewsAPI service and transfers the extracted text to an S3 bucket. It is designed to run as a scheduled job on AWS Lambda.

Functionality

The pipeline follows the ETL process with the following steps:

  1. Extract: The pipeline calls the NewsAPI service to fetch URLs of news articles based on a search query and date range.
  2. Transform: The visible text is extracted from each article by parsing the HTML content of the article's URL.
  3. Load: The extracted text is uploaded to an S3 bucket for further processing and analysis.

Setup and Configuration

To deploy and run the News to S3 Lambda pipeline, follow these steps:

  1. Clone the project repository.
  2. Create and configure your AWS IAM user, credentials, S3 bucket, Lambda function and associated policies.
  3. Customize the pipeline configuration by modifying the relevant variables. For example, search query, S3 bucket name, and namespace prefix in news_article_pipeline.py. (See Planned Work)
  4. Set up the necessary secrets for AWS Secrets Manager with the required values. For example, your secrets should contain the necessary API keys.
  5. With invoke installed, build the Lambda package by running the invoke task:
    invoke package-lambda
  6. Deploy the Lambda function by uploading the built package via your preferred method (e.g., AWS CLI, AWS Console).
  7. Schedule the Lambda function to run as a scheduled job at your desired frequency via AWS CloudWatch, EventBridge, etc...

Planned Work

The News to S3 Lambda pipeline is an initial implementation that can be further improved and extended. Some planned work includes:

  • Testing: Adding comprehensive test cases to ensure the correctness and robustness of the pipeline.
  • Abstraction: Abstracting the pipeline to allow for more general use by making the query term and other parameters configurable.
  • Continuous Deployment: Setting up a continuous deployment task to automatically update the Lambda function package when changes are made.

Contributing

Contributions to the News to S3 Lambda project are welcome! If you encounter any issues or have suggestions for improvements, please feel free to submit a pull request or open an issue in the project repository.

License

This project is licensed under the MIT License.

About

Fetch news articles about a topic and transfer them to S3. Wrapped up as a Lambda handler.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages