Skip to content

GuanLab/ImageDoubler

Repository files navigation

ImageDoubler: Image-based Doublet Identification in Single-Cell Sequencing

This is the package of ImageDoubler, an image-based doublet detection model implemented on Faster-RCNN. It is trained on the images from the Fluidigm C1 platform. Please contact ([email protected] or [email protected]) if you have any questions or suggestions.

Figure1


Installations

Git clone this repository:

git clone https://github.com/GuanLab/ImageDoubler.git

Setup the running environment

  • For training/evaluating the model:
# it may take about 10 - 15 minutes to finish setup
conda env create -f environment.yml
  • For benchmark:
# it may take about 5 minutes to finish setup
conda env create -f scrna.yml

# for running the SoCube, which may need python 3.9 for its installation
conda create -n socube python=3.9
pip install socube

Creating environments from the .yml files sometimes may fail. Users can also try the conda commands in setup_environment.sh for setting the environments


Repeat the training and evaluation results in the paper

Download image data, processed expression data

Please follow the READMEs in data/ and expression/

Download the pre-trained weights

Please follow the READMEs in imagedoubler_paper/model_data/ for the ResNet-50 weights and imagedoubler_paper/logs/ for ImageDoubler weights

Train and evaluate from scratch

The training and evaluation codes have been tested on both Linux and Windows system. Make sure that you have downloaded the ResNet-50 weights has been downloaded.

  1. Prepare the data splits for cross-validation
cd imagedoubler_paper/
python data_prepare.py 
  1. Train models for LOOCV
# python train.py loocv/[Image_set_for_test] [model_num]
# Here showed an example of using the images from image set 1 as the test set 
# The other images are used as training and validation
for model in 1 2 3 4 5; do
    CUDA_VISIBLE_DEVICES=0 python train.py loocv/Image1 $model
done
  1. Train models for evaluations with expression data
# Images from image sets 5 and 11 are used as test data
for model in 1 2 3 4 5; do
    CUDA_VISIBLE_DEVICES=0 python train.py for_expression/ $model
done
  1. Generate inferences for LOOCV
for model in 1 2 3 4 5; do
    # retrieve the cells' confidence score and position
    # e.g.: cell 0.9895 48 73 55 79
    CUDA_VISIBLE_DEVICES=0 python get_map.py loocv/Image1 $model
    # generate the images with bounding boxes  
    CUDA_VISIBLE_DEVICES=0 python predict.py loocv/Image1 $model
done
  1. Generate inferences for evaluation with expression data
for model in 1 2 3 4 5; do
    python get_map.py for_expression/ $model  
    python predict.py for_expression/ $model
done
  1. Evaluations
python get_accuracy.py
python get_confusion.py

The full pipeline is integrated in run.sh. Linux users can directly run it by bash run.sh.

You can also use our pre-trained model to skip the training (steps 1-3), and directly run the inference or evaluation codes.

Process the sequencing data

The processed expression matrix is provided. It can be downloaded from expression.zip. Move it and unzip under the expression/ folder.

To start from the raw data, they can be downloaded at:

Codes for processing are available at scripts/expression/, you may need to modify the data paths in the codes accordingly:

  1. Demultiplex
./mRNASeqHT_demultiplex.pl -i input/dir/of/fastq/data/ -o output/dir/
# outputs may include the files like:
#   - 114709_TAAGGCGA_S1_ROW01_R1.fastq
#   - 114709_TAAGGCGA_S1_ROW01_R2.fastq
#   - ...
#   - 114709_TAAGGCGA_S1_ROW40_R2.fastq
#   - 114709_TAAGGCGA_S1-Undetermined_R1.fastq
#   - 114709_TAAGGCGA_S1-Undetermined_R2.fastq
  1. Extract expression data with kallisto
# C1-SUM149-H1975_columns.csv for image set 5
# C1-SUM149-SUM190_columns.csv fo image set 11
python run_kallisto.py the_column_file.csv
  1. Arrange the expression data into a matrix with tximport
Rscript generate_count_matrix.R

Benchmark

Scripts for running the other doublet detection methods are in scripts/benchmark/
To compare and visualize the results, use scripts/benchmark/benchmark.ipynb The data/file paths in these scripts need to be adjusted accordingly.


Train and inference with custom image data

Download the pre-trained weights

  • (If you train from scratch) Download the ResNet-50 weights in imagedoubler/model_data/
  • (If you finetune on ImageDoubler or use it just for inference) Download the ImageDoubler's weights in imagedoubler/logs/

Prepare the inputs for training

  1. Images of the cells
  2. The annotation file include the information of training images, which should contain the information like the following example:
# columns are:
# path_of_image xmin1,ymin1,xmax1,ymax1,0 xmin2,ymin2,xmax2,ymax2,0 ...
JPEGImages/Image11_40_12.jpg 46,70,52,77,0
JPEGImages/Image7_16_13.jpg
JPEGImages/Image8_12_20.jpg 118,181,134,192,0 211,143,219,156,0
JPEGImages/Image2_15_7.jpg 120,180,129,189,0 188,41,194,50,0
JPEGImages/Image7_9_12.jpg 118,179,130,190,0
...
  1. The annotation file include the information of validation images. It contains similar information as the train annotation organized in the same format

Training

cd imagedoubler
python train.py --train-anno path/to/annotation_images_train.txt \
                --val-anno path/to/annotation_images_validation.txt \
                --model-id 0 \  # a specific ID to name the model
                --pretrain-weight path/to/resnet_weight_or_ImageDoubler_weight.h5 \
                --out-dir path/to/directory_of_saving_weights_and_logs/

Making inference

Note: User can skip the training phase and use the ImageDoubler's weights to infer their custom images.

python predict.py --model-path path/to/custom_trained_or_ImageDoubler_model_weight.h5 \
                  --model-id 0 \  # should be consistent with the ID in the model name
                  --conf 0.7 \  # The confidence threshold for the detection
                  --image-dir path/to/directory_of_test_images/ \
                  --out-dir path/to/directory_for_save_inference_results/

The test images with inferred bounding boxes can be checked at: <out_dir>/detection-results-img-model<model_id>

Ensemble and generate final decision

# List the IDs for the models that you want to ensemble after --model-ids, 
# IDs should be separated by space
# IDs should be those given to train.py and predict.py
python ensemble.py --model-ids 1 2 3 4 5

This program will generate a output.csv file with the contents like below, indicating the cell condition for each image.

image_id,1,2,3,4,5,Ensemble
Image1_10_1,Singlet,Singlet,Singlet,Singlet,Singlet,Singlet
Image1_10_10,Singlet,Doublet,Doublet,Doublet,Doublet,Doublet
Image1_10_11,Singlet,Doublet,Singlet,Singlet,Doublet,Singlet
Image1_10_12,Singlet,Singlet,Singlet,Singlet,Singlet,Singlet
Image1_10_13,Singlet,Singlet,Singlet,Singlet,Singlet,Singlet
...

Reference

The Faster-RCNN implementation is based on the codes from: https://github.com/bubbliiiing/faster-rcnn-keras

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published