Identifying homogeneous subgroups of patients and important features: a topological machine learning approach
Ewan Carr1
Mathieu Carrière2
Bertrand Michel3
Fred Chazal4
Raquel Iniesta1
1 Department of Biostatistics and Health Informatics,
Institute of Psychiatry, Psychology & Neuroscience, King's College London, United Kingdom.
2Inria Sophia-Antipolis, DataShape team, Biot, France.
3Ecole Centrale de Nantes, LMJL — UMR CNRS 6629, Nantes, France.
4Inria Saclay, Ile-de-France, France.
For more information, please see the 🔓 paper in BMC Bioinformatics:
Carr, E., Carrière, M., Michel, B. et al. Identifying homogeneous subgroups of patients and important features: a topological machine learning approach. BMC Bioinformatics 22, 449 (2021). https://doi.org/10.1186/s12859-021-04360-9
Please contact [email protected] for queries.
This repository provides a pipeline for clustering based on topological data analysis:
Background This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph.
Results We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper.
Conclusions Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline.
Our pipeline is written in Python 3 and builds on several open source packages,
including sklearn-tda
,
GUDHI
,
xgboost
, and
pygraphviz
.
To get started, clone this repository and create a new virtual environment:
git clone https://github.com/kcl-bhi/mapper-pipeline.git
cd mapper-pipeline
python3 -m venv env
source env/bin/activate
pip install -r requirments.txt
The pipeline expects an input dataset in CSV format. In our application, the
dataset contained information on ≈140 variables for ≈430 participants. If
input.csv
is not found in the working directory, sample data will be
simulated.
The scripts should be used as follows:
-
Generate input files
python3 prepare_inputs.py
- Load the input dataset.
- Construct the Gower distance matrix. Note that this requires
categorical variables to be specified using
categorical_items.csv
. - Define sets of parameters to explore via grid search.
- Create a dictionary containing all combinations of input parameters.
params = [{'fil': f, 'res': r, 'gain': gain} for f in fil.items() for r in resolutions for gain in [0.1, 0.2, 0.3, 0.4]]
- Store each set of inputs, and other required data, in the
inputs
directory.
-
Run Mapper for each set of input parameters
The script
test_single_graph.py
runs Mapper for a single set of parameters. It requires three arguments:python3 test_single_graph.py '0333' 'inputs' 'outputs'
0333
refers to the set of parameters to test;inputs
andoutputs
specify the folders to load inputs and save outputs. This script:- Runs Mapper for the specified parameters (using
MapperComplex
). - Identifies statistically significant, representative, topological features.
- Extracts required summaries and stores in the
outputs
subfolder.
- Runs Mapper for the specified parameters (using
-
Process all outputs and produce summaries
python3 process_outputs.py
This file:
- Loads all outputs (from
outputs
) - Excludes graphs with no significant features or duplicate graphs.
- Splits each graph into separate topological features and removes features with <5% or >95% of the sample.
- Derives required summaries for each feature. This includes homogeneity among feature members with respect to pre-specified outcome.
- Ranks all features by homogeneity and select the top N features.
- Visualise each top-ranked feature and output summaries to spreadsheet.
- Loads all outputs (from
The grid search can be time-consuming, especially as the number of parameters settings increase. Fortunately, this process can be straightforwardly parallelised either using multiple cores on a local machine or using cluster computing.
On a single machine, using parallel
:
python3 prepare_inputs.py
parallel --progress -j4 < jobs
On a cluster:
#!/bin/bash
#SBATCH --tasks=8
#SBATCH --mem=4000
#SBATCH --job-name=array
#SBATCH --array=1-2000
#SBATCH --output=logs/%a.out
#SBATCH --time=0-72:00
count=$(printf "%03d" $SLURM_ARRAY_TASK_ID)
python3 test_single_graph.py $count "inputs" "outputs"