Luigi-based pipeline to sweep and select machine learning models for the plastics outcomes projection. This is used by https://global-plastics-tool.org/.
Pipeline which executes pre-processing and machine learning tasks, working on the raw "input" data for the plastics business as usual projection model to make those projections multiple ways:
- Naive: Simple polynomial curve fitting extrapoloation of past trends for trade, waste, and consumption.
- Curve: Simple polynomial model that predicts trade, waste, and consumption having fit a curve against those response variables using population and GDP as input.
- ML: A more sophisticated machine learning sweep which considers SVR, CART / trees, AdaBoost, and Random Forest.
In practice, the machine learning branch is used by the tool.
Most users can simply reference the output from the latest execution. That output is written to https://global-plastics-tool.org/datapipeline.zip and is publicly available under the CC-BY-NC License. That said, users may also leverage a local environment if desired.
A containerized Docker environment is available for execution. This will conduct the model sweeps and prepare the outputs required for the front-end tool. See COOKBOOK.md for more details.
In addition to the Docker container, a manual environment can be established simply by running pip install -r requirements.txt
. This assumes that sqlite3 is installed. Afterwards, simply run bash build.sh
.
The configuration for the Luigi pipeline can be modified by providing a custom json file. See task/job.json for an example. Note that the pipeline, by default, uses random forest even though a full sweep is conducted because that approach tends to yield better avoidance of overfitting. Parallelization can be enabled by changing the value of workers
.
For examples of adding new regions or updating existing data, see COOKBOOK.md.
Inputs snapshot for reproducibility is located in data/snapshot.db
. Use of this preformatted dataset is controlled through const.USE_PREFORMATTED
which defaults to True meaning that the included SQLite snapshot is used. For more details see the data
directory.
Note that an interactive tool for this model is also available at https://github.com/SchmidtDSE/plastics-prototype.
Setup the local environment with pip -r requirements.txt
.
Some unit tests and other automated checks are available. The following is recommended:
$ pip install pycodestyle pyflakes nose2
$ pyflakes *.py
$ pycodestyle *.py
$ nose2
Note that unit tests and code quality checks are run in CI / CD.
This pipeline can be deployed by merging to the deploy
branch of the repository, firing GitHub actions. This will cause the pipeline output files to be written to https://global-plastics-tool.org/datapipeline.zip.
CI / CD should be passing before merges to main
which is used to stage pipeline deployments and deploy
. Where possible, please follow the Google Python Style Guide. Please note that tests run as part of the pipeline itself and separate test files are encouraged but not required. That said, developers should document which tasks are tests and expand these tests like typical unit tests as needed in the future. We allow lines to go to 100 characters.
Citations for data in this repository:
- DESA, World Population Prospects 2022 (2022).
- R. Geyer, J. R. Jambeck, K. L. Law, Production, use, and fate of all plastics ever made. Sci. Adv. 3, e1700782 (2017).
- C. Liu, S. Hu, R. Geyer. Manuscript in Process (2024).
- OECD, Real GDP long-term forecast (2023).
Our thanks to those authors and resources. Manuscript in progress data available upon request to authors.
See also source code for the web-based tool running at global-plastics-tool.org and source code for the GHG pipeline.
This project is released as open source (BSD and CC-BY-NC). See LICENSE.md for further details. In addition to this, please note that this project uses the following open source:
- Luigi under the Apache v2 License.
- Pathos under the BSD License.
- scikit-learn under the BSD License.
- scipy under the BSD License.
The following are also potentially used as executables like from the command line but are not statically linked to code:
- Docker under the Apache v2 License.
- Python 3.8 under the PSF License.
- SQLite 3 which is in the public domain.
Additional license information:
- DESA. "World Population Prospects 2022." United Nations, Department of Economic and Social Affairs, Population Division, 2022. https://population.un.org/wpp/Download. CC BY 3.0 license IGO.
- OECD. "Real GDP Long-Term Forecast." OECD, 2023. https://doi.org/10.1787/d927bc18-en. Part of long term baseline projections available under a permissive license under terms Ic which carries an acknowledgement requirement.