Automated supervised learning pipeline for non-targeted GC-MS data analysis

Sirén K, Fischer U and Vestner J. Automated supervised learningpipeline for non-targeted GC-MS data analysis, Analytica Chimica Acta X, https://doi.org/10.1016/j.acax.2019.100005.

Hands-off Python-based workflow using supervised learning to select important features which are responsible for class differentiation directly from raw GC-MS data, before any downstream analysis. Currently optimized to work on unit mass resolution MS data but could be easily expanded to high resolution MS data or 2D chromatographic systems. Currently optimized to work with a segmentation strategy of the chromatograms, but can also be easily adapted to work with other "preselection" approaches such as more common feature extraction methods or peak picking.

Workflow requirements

currently served inside a jupyter notebook. Install any needed packages specified in requirements.txt. Contact authors for guidance.

pip3 install --trusted-host pypi.python.org -r requirements.txt

Setup for analysis

Set the filenames correctly (metadata currently read from the samplenames)
Set folder structure accordingly
Set the paths to folders in the jupyter notebook section 1.4. "Import the the metadata"

In brief what actually happens

All chromatograms are segmented along the retention time axes. For each sample, each segment (data matrix of scans x m/z) is transformed into a "covariance" matrix. For each segment the transformed matrices of all samples are then joined together to form a tensor (3D array, stack of matrices). Each tensor is decomposed and a xgboost model is fitted to learn a model. Model predictions are evaluated against the test-set (currently leave-one-out strategy, can be changed easily). Prediction scores for each segment are evaluated. Segments with best scoring are then retained for downstream analysis saving a lot of manual work.

During analysis two files are generated.

A savefile .npz (for wine dataset 806MB) and a segmentation result .csv file (for wine dataset 1.5MB) is generated

Note: Currently you can only run this once with the same savefilename (without renaming or deleting the results) as the the segmentation results get added to the csv segmentation result .csv file

In order to apply this to other datasets the following edits are needed:

The classes selections need to be optimised for other datasets plotting colors and markers based on sample classes no universal sample import scheme

References

Workflow reference https://www.sciencedirect.com/science/article/pii/S2590134619300015

Wine dataset reference https://www.sciencedirect.com/science/article/pii/S0003267016300903

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
chroms2		chroms2
libs		libs
README.md		README.md
conda_list_versions		conda_list_versions
kimonoloader.ipy		kimonoloader.ipy
requirements.txt		requirements.txt
rice_dataset.ipynb		rice_dataset.ipynb
urea_dataset.ipynb		urea_dataset.ipynb
wine_dataset.ipynb		wine_dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated supervised learning pipeline for non-targeted GC-MS data analysis

Workflow requirements

Setup for analysis

In brief what actually happens

During analysis two files are generated.

In order to apply this to other datasets the following edits are needed:

References

About

Releases

Packages

Contributors 2

Languages

kkpsiren/vesi

Folders and files

Latest commit

History

Repository files navigation

Automated supervised learning pipeline for non-targeted GC-MS data analysis

Workflow requirements

Setup for analysis

In brief what actually happens

During analysis two files are generated.

In order to apply this to other datasets the following edits are needed:

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages