SCALPEL3 framework is designed to perform scalable, reproducible, and easy medical concept extraction from large observational databases (LODs). In its current form, this library focuses on [SNDS]((https://www.snds.gouv.fr/SNDS/Accueil) Data. However, we believe it could be easily adapted to other LODs.
This work results from a research Partnership between École Polytechnique & Caisse Nationale d'Assurance Maladie started in 2015 by Emmanuel Bacry and Stéphane Gaïffas. Since then, many research engineers and Ph.D. students developed and used this framework to perform research on SNDS data, the full list of contributors is available in CONTRIBUTORS.md. This library is based on PySpark.
SCALPEL-Flattening, based on Apache Spark, denormalizes Système National des Données de Santé (SNDS) data to accelerate concept extraction when using SCALPEL-Extraction. Denormalization consists of several join operations and data compression, resulting in a big table representing SNDS databases, such as DCIR or PMSI. Denormalize3d tables are outputted in Apache Parquet or Apache ORC files. The source code and documentation can be found in this repository.
SCALPEL-Extraction provides concept extractors meant to fetch meaningful Medical Events & Patients from Système National des Données de Santé (SNDS) data.
This library is based on Apache Spark. It reads denormalized data resulting from executing SCALPEL-Flattening on raw SNDS data. It then extracts sets of Patients
with associated Events
stored as Parquet or ORC files along with their metadata tracking the transformation applied to the input data. The source code and documentation can be found in this repository.
SCALPEL-Analysis is based on PySpark. It provides useful abstractions easing cohort data analysis and manipulation. While it can be used as a standalone, it expects inputs formatted as the data resulting from SCALPEL-Extraction concept extraction. The source code and documentation can be found in this repository.
If you use a library part of SCALPEL3 in a scientific publication, we would appreciate citations. You can use the following BibTeX entry:
@article{bacry2020scalpel3,
title={SCALPEL3: a scalable open-source library for healthcare claims databases},
author={Bacry, Emmanuel and Gaiffas, St{\'e}phane and Leroy, Fanny and Morel, Maryan and Nguyen, Dinh-Phong and Sebiat, Youcef and Sun, Dian},
journal={International Journal of Medical Informatics},
pages={104203},
year={2020},
publisher={Elsevier}
}