COVID-CXR-Classification

This project built a pipeline with multiple trained models to classify Chest X-Ray images into Normal/Penumonia/COVID-19.

Features

DVC pipeline (with simple CLI for setup and run) to version and reproduce the whole process
Segmentation and Adaptive Histogram Equilization with OpenCV in preprocess
Over 20,000 CXRs and labels
Visualization of image transformation for clarification
Trained models including ResNet, VGGNet in tensorflowCOVID-Net in pytorch.
Augmentation in training
Grad-CAM Visualization of model feature for clarification
Hyperparameter tunning for ETL, training, evaluating, models, visualization.

Environment

Main software packages include:

  - conda=4.9.1
  - python=3.6
  - pyspark=3.0.1
  - pytorch=1.3.1
  - tensorflow=2.3.1
  - dvc==1.10.2
  - torchvision=0.4.2
  - scikit-learn
  - numpy
  - scipy
  - pandas
  - matplotlib=3.1.0
  - pydicom
  - opencv
  - ipython
  - notebook
  - jupyter
  - ipykernel
  - pip

Setup

Set up Python, Anaconda, Git, a cloned copy of the project
Create environment with

conda env create -f env.yml

Download data. See data source
[Optional] Download COVID-Net models if you want. See model source
For Windows OS, build pipeline with

setup

Usage

For Windows OS, run pipeline with

run

Pipeline

The pipeline is built on DVC (Data Version Control). It caches and versions data flow, constructs a DAG (directed acyclic graph) used to reproduce the whole procedure. The DAG consists of series of ordered stages with dependenceis and outputs including hyperparameter setting. Each stage executes an OS-dependant cmd (only support Windows now). The pipeline executes a series of numbered main files (.ipynb, .py) located in ./src/main/, and also computes hashes located in local ./.hash/ for the pipeline graph. Output of .ipynb main files as part of stage cmds is converted to local HTML files for readability. Output .py main file ./src/main/200 Train.py as part of stage cmd is redirected to local train.log.txt for readability. Examples are displayed in ./main files demo/.

DAG

Hyperparameters

params.yaml is the hyperparameter file as part of the graph in DVC pipeline, and used for user to fine-tune the whole procedure from ETL, model setup to model training and visualization. It fine-tunes:

ETL: image size, crop area, spark control, segmentation control, adaptive or global histogram equilizaiton, etc.
Model: model tool (tensorflow/pytorch), model name, model architect (VGG, ResNet, COVID-Net), transfer-learning control, etc.
Train: epoch, learning rate, batch size, COVID-19 label weights in batch, COVID-19 label weights in optimization, etc.
Visualization: example number, figure size for both transformation and model feature.

Data

Data Source

The current COVIDx dataset is constructed by the following open source chest radiography datasets:

Preprocess

Use PySpark to do ETL process with image meta. Transform image data including histogram equilization and segmentaiton with OpenCV and prepare image data ready for PyTorch. See ./src/etl.py, ./src/etl_spark.py for Spark ETL. See ./src/transform.py for image pre-training transform.

Data description:

# total
Counter({'pneumonia': 11092, 'normal': 10340, 'covid': 617})
# test data from covid datasets: 
Counter({2: 2219, 1: 2068, 0: 124})
# train data from covid datasets: 
Counter({2: 8873, 1: 8272, 0: 493})

Demo of Pretraining Transform

Model

Supported models include VGG11, VGG19, ResNet18, ResNet50, COVID-Net-CXR-A, COVID-Net-CXR-Large, COVID-Net-CXR-Small. ResNet and VGGNet are in PyTorch and has complete computational model structure with weights. COVID-Net is from https://github.com/lindawangg/COVID-Net, it doesn't have full computational model but meta graph with saved weights checkpoints.

Trained model and weights

For COVID-Net tensorflow models, access metagraph and checkpoints source from https://github.com/lindawangg/COVID-Net. For VGGNet, ResNet pytorch models, access saved model from ./model/

Demo of Grad-CAM visualization of model features

Demo of model output metrics

Learning curve - PPVs
Learning curve - TPRs
Learning curve - losses
Confusion Matrix (horizontally normalized for PPV/Sensitivity/Specivity)

Train and Evaluate

In-training augmentation.
Due to sample inbalance, batch weights and optimization weights for COVID-19 are balanced according to setup weights from params.yaml.
For VGGNet, ResNet, you can choose to train from refresh, from downloaded pretrained model with torchvision, or from pretrained saved model in ./model/. For COVID-Net, you can choose to train form refresh or from previous checkpoint.

Iterations

Two iteration of modeling and training results are currently available for ResNet.
Changes from iter1 to iter2:

add histogram equilization
add segmentation
image shape from 224x224x3 to 480x480x1

Issues

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
solve:
https://github.com/dotnet/spark/issues/61\ https://stackoverflow.com/questions/51922477/running-spark-pyspark-first-time\
tensorflow 1.x default builds DO NOT include CPU instructions that fasten matrix computation including avx, avx2, etc,.
solve: see explains at https://stackoverflow.com/questions/47068709/your-cpu-supports-instructions-that-this-tensorflow-binary-was-not-compiled-to-u\ see wheel downloads at https://github.com/fo40225/tensorflow-windows-wheel/tree/master/2.1.0/py37/CPU%2BGPU/cuda102cudnn76avx2\ install by 'pip install --ignore-installed --upgrade /path/target.whl'
COVID-Net is only a meta graph with saved checkpoints. Unable to visualize its features.

Todo

collect new data for test
CLI with more functionality

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.dvc		.dvc
demo		demo
doc		doc
gradcam		gradcam
model		model
output		output
src		src
visual		visual
.dvcignore		.dvcignore
.gitattributes		.gitattributes
.gitignore		.gitignore
DAG.dot		DAG.dot
DAG.png		DAG.png
README.md		README.md
dvc.yaml		dvc.yaml
env.yml		env.yml
params.yaml		params.yaml
run.bat		run.bat
setup.bat		setup.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-CXR-Classification

Features

Environment

Setup

Usage

Pipeline

DAG

Hyperparameters

Data

Data Source

Preprocess

Data description:

Demo of Pretraining Transform

Model

Trained model and weights

Demo of Grad-CAM visualization of model features

Demo of model output metrics

Train and Evaluate

Iterations

Issues

Todo

About

Releases

Packages

Contributors 2

Languages

hzhaoc/cnn-COVID-CXR

Folders and files

Latest commit

History

Repository files navigation

COVID-CXR-Classification

Features

Environment

Setup

Usage

Pipeline

DAG

Hyperparameters

Data

Data Source

Preprocess

Data description:

Demo of Pretraining Transform

Model

Trained model and weights

Demo of Grad-CAM visualization of model features

Demo of model output metrics

Train and Evaluate

Iterations

Issues

Todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages