SCALPEL-Extraction is a library part of the SCALPEL3 framework,
resulting from a research Partnership between École Polytechnique &
Caisse Nationale d'Assurance Maladie
started in 2015 by Emmanuel Bacry and Stéphane Gaïffas.
Since then, many research engineers and PhD students developped and used this framework to do
research on SNDS data, the full list of contributors is available in CONTRIBUTORS.md.
It provides concept extractors meant to fetch meaningful Medical Events & Patients from Système National des Données de Santé (SNDS) data.
This library is based on Apache Spark. It reads flat data resulting from executing
SCALPEL-Flattening on raw SNDS data,
and then extracts Patients
and Events
in three steps:
- Reading the flat data from the files generated by the flattening step;
- Extracting "raw" events (such as Drug Dispensations, Diagnoses, Medical Acts, etc.) and convert them to Events;
- Transforming the "raw" events into "processed" events (such as follow-up periods, molecule exposures, outcomes, etc.) and convert them to Events;
Extracted data can be easily be used to perform interactive data analysis using SCALPEL-Analysis.
Important remark : This software is currently in alpha stage. It should be fairly stable, but the API might still change and the documentation is partial. We are currently doing our best to improve documentation coverage as quickly as possible.
To build a JAR from this repository, you need SBT v. 0.13.15 (Scala Build Tool) & the SBT assembly plugin. You just need to run the following commands:
git clone [email protected]:X-DataInitiative/SCALPEL-Extraction.git
cd SCALPEL-Extraction
sbt assembly
###Input and Output
SCALPEL-Extraction reads flat data resulting from executing SCALPEL-Flattening, which is saved in Parquet or ORC.
It should select the correct format in the configuration(example).
read_file_format
is used to set the format to read flat data.
Once a job is finished, the results are saved in the file system (local file system or HDFS)
in Parquet format or ORC format.
write_file_format
is used to set the format to save the results.
if the ORC format is set in the configuration files,
it should add a special configuration in the spark-submit command spark.sql.orc.impl=native
Right now, configurations are tied to "studies". As study can be seen as a sub-package, eventually containing custom
extractors, a main class orchestrating the extraction, and a default configuration.
For each study, a template configuration file containing the default values is defined.
When running the study main class, if a parameter needs change, one just needs to copy this template
file, edit it by uncommenting and modifying the appropriate lines, and passing it to spark-submit
using the conf
argument.
For example, the template configuration file for a study on elderly falls and several drug groups association
is defined here. So, if one wants to
override min_purchases
, purchases_window
and cancer_definition
, they just need to create a
copy of this file on the master server and uncomment these lines changing the appropriate values:
# Previous line stay commented...
# exposures.purchases_window: 0 months // 0+ (Usually 0 or 6) Represents the window size in months. Ignored when min_purchases=1.
# exposures.end_threshold_gc: 90 days // If periodStrategy="limited", represents the period without purchases for an exposure to be considered "finished".
# exposures.end_threshold_ngc: 30 days // If periodStrategy="limited", represents the period without purchases for an exposure to be considered "finished".
exposures.end_delay: 30 days // Length of period to add to the exposure end to delay it (lag).
# drugs.level: "Therapeutic" // Options are Therapeutic, Pharmacological, MoleculeCombination
drugs.families: ["Antihypertenseurs", "Antidepresseurs", "Neuroleptiques", "Hypnotiques"]
# Next lines stay commented...
This file should then be stored with the results, to keep a trace of which configuration was used to generate a dataset. The commit number of the code used to extract events is included in SCALPEL-Extraction results (metadata file). As a result, the configuration file and the metadata should be enough to reproduce a dataset extraction.
The entry points for executing the extraction are study-specific, therefore within the study
package. To steps to run the extraction for a given study are the following:
To start an extraction job, run the spark-submit
command containing:
--total-executor-cores
and--executor-memory
arguments;--class
argument pointing to the study's main class;- the path to the jar file created by
sbt assembly
; - The name of the environment (i.e. the set default parameters of the study) to be used;
- Eventually, a
--conf
parameter to override the environment parameters.
One can create an alias or script to make things easier. For example, for the Pioglitazone study, one could run the following shell script:
#!/bin/sh
spark-submit \
--driver-memory 40G \
--executor-memory 110G \
--total-executor-cores 150 \
--conf spark.task.maxFailures=20 \
--class fr.polytechnique.cmap.cnam.study.pioglitazone.PioglitazoneMain \
SCALPEL-Extraction-assembly-2.0.jar conf=./overrides.conf env=cmap
The Bulk Main is a special study that transforms all the SNDS data into our normalized format based on the Event class. It is intended to simplify the complexity of the SNDS and ease the statistical analysis. The extractors available in the Bulk are listed here.
The steps to use the Bulk Main are:
- Add a file under the directory
src/main/resources/config/bulk/paths
, for example,my_env.conf
. - In
my_env.conf
, add the links to your flattened SNDS data. Seecmap.env
for an example. - In the file
src/main/resources/config/bulk/default.conf
, add the following
my_env = ${root} {
include "paths/my_env.conf"
}
- Build you JAR.
- Using shell, execute the following script
spark-submit \
--total-executor-cores 160 \
--executor-memory 18G \
SCALPEL-Extraction-assembly-2.0.jar env=my_env
- Use the resulting
metadata.json
to load and analyse your data using SCALPEL-Analysis.
Our package offers ready-to-use Extractors intending to extract events from raw, flat SNDS data and to output them into a simpler, normalized data format. For more details, please read the companion paper.
Extractor | SNDS data Source | Description |
---|---|---|
Act | MCO | Extracts CCAM acts available PMSI-MCO (Public hospitalization) |
Act | DCIR | Extracts CCAM acts available DCIR (One day & liberal acts) |
Act | MCO-ACE | Extracts CCAM acts available MCO-ACE |
Main Diagnosis | MCO | Extracts the main diagnosis coded in available PMSI-MCO in CIM-10 format |
Linked Diagnosis | MCO | Extracts the linked diagnosis coded in available PMSI-MCO in CIM-10 format |
Associated Diagnosis | MCO | Extracts the associated diagnosis coded in available PMSI-MCO in CIM-10 format |
Long Term Diseases | IMB | Extracts Long term diseases ('ALD' in French) diagnosis available IMB |
Hospital stay | MCO | Extracts hospital stays |
Main Diagnosis | MCO | Extracts the main diagnosis coded in available PMSI-MCO in CIM-10 format |
Drug Purchases | DCIR | Extracts drug purchases available in DCIR. Can be adjusted to any the desired level: CIP13, molecule, Pharma class & Thera class. |
Patients | Mainly IR-BEN-R | Extracts available patients with their gender, birth date, and eventual death date |
Transformers combine multiple events produced by the extractors to produce higher-level, more complex Events. Although they are often study-specific, the transformers code can be configured up to some point. If they are not flexible enough for your use-case, they can be used as a good starting point for your custom implementations.
Transformer | Combines | Description |
---|---|---|
FollowUp | Multiple | Combines multiple events to define a follow-up period per patient. |
Exposures | Drug Purchases & Followup | Combines drugs purchases events to create Exposures. Offer multiple strategies such as Limited & Unlimited exposure duration. |
Outcome | Acts & Diagnosis | Creates complex outcomes based on Acts & Diagnosis data. Our list includes Fractures, Heart Failure, Infarctus, Bladder cancer. |
If you use a library part of SCALPEL3 in a scientific publication, we would appreciate citations. You can use the following bibtex entry:
@article{bacry2020scalpel3,
title={SCALPEL3: a scalable open-source library for healthcare claims databases},
author={Bacry, Emmanuel and Gaiffas, St{\'e}phane and Leroy, Fanny and Morel, Maryan and Nguyen, Dinh-Phong and Sebiat, Youcef and Sun, Dian},
journal={International Journal of Medical Informatics},
pages={104203},
year={2020},
publisher={Elsevier}
}
SCALPEL-Extraction is implemented in Scala 2.11.12 with Spark 2.3 and HDFS (Hadoop 2.7.3).
The code should follow the Databricks Scala Style Guide, (which relies on Scala Style Guide). You can use linters in your IDE (for instance scalastyle, shipped by default with IntelliJ) to help you comply with these stylesheets.
Also we should follow, as much as possible, clean code best practices (for instance, or the Clean Code). Among them :
- Meaningful variable name
- Methods should be small and only do one thing
- Avoid useless complexity
Our imports are based on the style suggested at the following link, with a few modifications.
Therefore, every contributor should update their IDE accordingly. On IntelliJ:
(Settings > Editor > Code Style > Scala > Imports
), we use the following importing order:
java
javax
scala
all other imports
org.apache.spark
fr.polytechnique