Skip to content

This repository proposes a python implementation of a nested cross-validation pipeline compatible with scikit-learn API.

License

Notifications You must be signed in to change notification settings

ncaptier/nested_cross_val

Repository files navigation

nested_cross_val

This repository proposes a python implementation of nested cross-validation compatible with scikit-learn API.

Our implementation stands out from already existing ones for three main reasons :

  • It integrates a dask implementation for managing large data sets and complex pipelines and save precious computational time (more details here).
  • It gives access to the fitted estimators and their attributes. Therefore the user can add scores without having to refit the whole model or run different analyses with the attributes of each estimator (ex : feature importance analysis through a stability study).
  • It provides some plotting tools to visualize and analyze easily the results of the nested cross-validation (see here).

ncv_image

Installation

$ pip install git+https://github.com/ncaptier/nested_cross_val#egg=nested_cross_val

Experiments

We provide a jupyter notebook for an illustration of our nested cross-validation pipeline with real data :
*Classification of lung cancer subtype from bulk transcriptomics data

Data

The data set which goes with the jupyter notebook lung_cancer_classification.ipynb can be found in the .zip file data.zip. Please extract locally the data set before running the notebook.

Example

from sklearn.linear_model import LogisticRegression
from nested_cross_val.base import NestedCV

estimator = LogisticRegression(solver='saga' ,penalty='l1' , max_iter = 2000)

param_grid = {'C': np.logspace(-2, 2, 20)}

ncv = NestedCV(estimator = estimator , params = param_grid , cv_inner = 5  , cv_outer = 5  , 
               scoring_inner = 'roc_auc' , scoring_outer = {'roc_auc' : 'roc_auc' , 'average_precision' : 'average_precision'})

ncv.fit(X , y)

Acknowledgements

This package was created as a part of my PhD in the Computational Systems Biology of Cancer group of Institut Curie and the LITO laboratory.

References

"Bias in error estimation when using cross-validation for model selection" - S. Varma & R. Simon 2006

About

This repository proposes a python implementation of a nested cross-validation pipeline compatible with scikit-learn API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages