This README provides a brief guide on how to set up Dozer for real-time data ingestion from an AWS S3 bucket. For a more comprehensive tutorial, please refer to our blog post.
- AWS account with access to S3 services
- AWS CLI installed and configured
- Python installed
- Dozer installed
- Generate and Upload Data to S3: Use a Python script to generate a dataset and upload it to an S3 bucket.
python create_dataset_and_upload_to_s3.py
If you already have a dataset in your S3 bucket, you can skip this step.
- Configure Dozer: Create a YAML configuration file that defines the data sources, transformations, and APIs. Checkout the sample Dozer configuration file dozer-config.yaml that uses AWS S3 connector.
connections:
- config : !S3Storage
details:
access_key_id: {{YOUR_ACCESS_KEY}}
secret_access_key: {{YOUR_SECRET_KEY}}
region: {{YOUR_REGION}}
bucket_name: aws-s3-sample-stock-data-dozer
tables:
- !Table
name: stocks
config: !CSV
path: . # path to files or folder inside a bucket
extension: .csv
name: s3
-
Running Dozer: Start Dozer by running the following command in the terminal:
dozer -c dozer-config.yaml
-
Querying the Dozer APIs: Query the Dozer endpoints to get the results of your SQL queries. You can query the cache using gRPC or REST.
Example queries:
# REST curl -X GET http://localhost:8080/analysis/ticker
-
Append New Data & Query: Dozer automatically detects and ingests new data files added to the bucket. This allows you to process recurring data without changing any configuration. You can upload a new file to the bucket and can see the dozer ingesting the newly uploaded files in console log.
If you encounter any issues or have suggestions, please file an issue in the issue tracker on our Github page or reach out to us on discord.
Happy coding with Dozer!
We love contributions! Please check our Contributing Guidelines if you're interested in helping!