This project showcases the rich AWS S3 Select
feature to stream a large data file in a paginated style
.
Currently, S3 Select
does not support OFFSET
and hence we cannot paginate the results of the query. Hence, we use scanrange
feature to stream the contents of the S3 file.
Importing (reading) a large file leads Out of Memory
error. It can also lead to a system crash event. There are libraries viz. Pandas, Dask, etc. which are very good at processing large files but again the file is to be present locally i.e. we will have to import it from S3 to our local machine. But what if we do not want to fetch and store the whole S3 file locally at once? 🤔
Well, we can make use of AWS S3 Select
to stream a large file via it's ScanRange
parameter. This approach is similar to how a pginated API works. Instead of limit and offset on records, we provide the limit and offset on the bytes to stream
. S3 Select
is intelligent enough to skip the whole row of the file if it does not fit in the byte range.
You can find an in-depth article on this implementation here.
- Python 3.9.13 or higher
- Up and running Redis client
- AWS account with an S3 bucket and an object (or upload from
/data/data.csv
) aws-cli
configured locally (having read access to S3)
📜 This project is a clone of one of my projects Flask Boilerplate to quickly get started on the topic 😆
# clone the repo
$ git clone https://github.com/idris-rampurawala/s3-select-demo.git
# move to the project folder
$ cd s3-select-demo
If you want to install redis via docker
# at the root of this project
$ docker run -d --name="flask-boilerplate-redis" -p 6379:6379 redis
- Install
pipenv
a global python projectpip install pipenv
- Create a
virtual environment
for this project
# creating pipenv environment for python 3.9 (if you have multiple python versions, then check last command)
$ pipenv --three
# activating the pipenv environment
$ pipenv shell
# install all dependencies (include -d for installing dev dependencies)
$ pipenv install -d
# if you have multiple python 3 versions installed then
$ pipenv install -d --python 3.9
- There are 3 configurations
development
,staging
andproduction
inconfig.py
. Default isdevelopment
- Create a
.env
file from.env.example
and set appropriate environment variables before running the project
- Run flask app
python run.py
- Logs would be generated under
log
folder
- Run redis locally before running celery worker
- Celery worker can be started with following command
# run following command in a separate terminal
$ celery -A celery_worker.celery worker -l INFO
# (append `--pool=solo` for windows)
Test if this app has been installed correctly and it is working via following curl commands (or use in Postman)
- Check if the app is running via
status
API
$ curl --location --request GET 'http://localhost:5000/status'
- Check if core app API and celery task is working via
$ curl --location --request GET 'http://localhost:5000/api/v1/core/test'
- Check if authorization is working via (change
API Key
as per you.env
)
$ curl --location --request GET 'http://localhost:5000/api/v1/core/restricted' --header 'x-api-key: 436236939443955C11494D448451F'
- To test the file streaming task, please upload the
/data/data.csv
on your AWS S3 and then update the .env with its bucket, key and profile-name
$ curl --location --request GET 'http://localhost:5000/api/v1/core/s3_select'
- To test the file parallel processing task, please upload the
/data/data.csv
on your AWS S3 and then update the .env with its bucket, key and profile-name
$ curl --location --request GET 'http://localhost:5000/api/v1/core/s3_select_parallel'
- My post explaining this approach
- AWS S3 Select userguide
- AWS S3 Select Example
- AWS S3 Select boto3 reference
This program is free software under MIT license. Please see the LICENSE file in our repository for the full text.