Proudly Powered by SURFRIDER Foundation Europe, this open-source initiative is a part of the PLASTIC ORIGINS project - a citizen science project that uses AI to map plastic pollution in European rivers and share its data publicly. Browse the project repository to know more about its initiatives and how you can get involved. Please consider starring ⭐ the project's repositories to show your interest and support. We rely on YOU for making this project a success and thank you in advance for your contributions.
Welcome to Plastic Origin ETL, an ETL (Extract Transform Load) Data Management process allowing to produce Data that will be leveraged within the Plastic Origin project. This Data will serve to build then analytics and reports of plastic pollution within rivers.
Please note this project is under development and that frequent changes and updates happen over time.
Before you begin, ensure you have met the following requirements:
- You have a
<Windows/Linux/Mac>
machine supporting python. - You have installed requirements.txt with pip install -r requirements.txt
- You have installed locally the latest version of azure function for python.
- Language:
Python
- Framework:
Python 3.7
- Unit test framework:
NA
The ETL API can be locally deployed using the Azure function framework.
It's recommended that you use python virtual environement before installing packages with pip. You also have to set the following environment variables: CONN_STRING, PGSERVER, PGDATABASE, PGUSERNAME, PGPWD.
cd src/batch/etlAPI/
pip install -r requirements
func start etlHttpTrigger
The GPS extraction subprocess requires to use binaries like ffmpeg which is not natively available within Python Azure Function. To address this requirement, the ETL Azure Function has been made available as a Docker image. The Docker file to build the image is located here You will need to pass the appropriate credential at run time for the ETL to use correctly Azure Storage and PostGre Database server.
cd src/batch/etlAPI/
docker build -t surfrider/etl:latest .
cd src/batch/etlAPI/
docker run -p 8082:80 --restart always --name etl -e PGUSERNAME=${PGUSERNAME} -e PGDATABASE=${PGDATABASE} -e PGSERVER=${PGSERVER} -e PGPWD=${PGPWD} -e CONN_STRING=${CONN_STRING} surfrider/etl:latest
target: csv or postgre, prediction: json or ai
7071 for local API, 8082 for docker API
manual: curl --request GET 'http://localhost:<port>/api/etlHttpTrigger?container=manual&blob=<blobname>&prediction=json&source=manual&target=csv&logid=<logid>'
mobile: curl --request GET 'http://localhost:<port>/api/etlHttpTrigger?container=mobile&blob=<blobname>&prediction=json&source=mobile&target=csv&logid=<logid>'
gopro: curl --request GET 'http://localhost:<port>/api/etlHttpTrigger?container=gopro&blob=<blobname>&prediction=json&source=gopro&target=csv&logid=<logid>'
mobile: curl --request GET 'http://localhost:<port>/api/etlHttpTrigger?container=mobile&blob=<blobname>&prediction=ai&source=mobile&target=csv&aiurl=<aiurl>&logid=<logid>'
gopro: curl --request GET 'http://localhost:<port>/api/etlHttpTrigger?container=gopro&blob=<blobname>&prediction=ai&source=gopro&target=csv&aiurl=<aiurl>&logid=<logid>'
The ETL Trigger Azure function defines additionnaly 3 x functions that will automatically call the ETL API when new media to process are stored in Azure. They used the blob trigger capabilities defined within the function.json. Simplest way for testing is to publish directly to Azure with:
cd src/batch/etlBlobTrigger/
func azure functionapp publish <AZUREFUNCTIONApp>
The ETL is made of three parts:
- The BlobTrigger function which is triggered by new media
- The ETL API function which does the actual ETL process
- The ETL batch which is a DAG running on Airflow
If you want to re-run the full processing from starting data:
- We assume that files (.json and .mp4 if automatic) are present in the blob storage
- We assume that the blob trigger worked properly and the table
campaign.campaign
contains the campaign details.
There are two steps
trigger_batch_etl_all
The trigger_batch_etl_all
will re-insert the trash in campaign.trash
. To rerun this process for a given campaign campaign_id
, you need to:
- remove the rows in
campaing.trash
which corresponds to thecampaign_id
- In
logs.etl
row with the correspondingcampaign_id
, set the column "status" to "notprocessed"
You may then run the trigger_batch_etl_all
DAG.
bi processing and postprocessing
The bi-processing
will recompute the different metrics related to the trash and campaign. To rerun this process for a given campaign campaign_id
, you need to:
- In table
campaign.campaign
rowcampaign_id
, set the column "has_been_computed" to NULL - Remove the line corresponing to
campaign_id
inbi_temp.pipelines
You may then run the bi-processing
DAG which will update the bi
tables and run bi-postprocessin
It's great to have you here! We welcome any help and thank you in advance for your contributions.
-
Feel free to report a problem/bug or propose an improvement by creating a new issue. Please document as much as possible the steps to reproduce your problem (even better with screenshots). If you think you discovered a security vulnerability, please contact directly our Maintainers.
-
Take a look at the open issues labeled as
help wanted
, feel free to comment to share your ideas or submit a pull request if you feel that you can fix the issue yourself. Please document any relevant changes.
If you experience any problems, please don't hesitate to ping:
Special thanks to all our Contributors.
We’re using the MIT
License. For more details, check LICENSE
file.