AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions
This repository contains the supplementary material accompanying the paper, AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions. The dataset is available at https://avida-hil6.cognanous.com under a CC BY-NC 4.0 license.
To get started, clone this repository and run the following command to create a virtual environment.
python3 -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt
Dataset | Download |
---|---|
Raw data (840 FASTQ file) | Link to Google Drive |
VHH sequences data (420 FASTA file) | Link to Google Drive |
Labeled dataset (1 CSV file) | Link to Zenodo |
For more information, see the AVIDa-hIL6 project page.
A labeled CSV file has four columns as shown below.
Column | Description |
---|---|
VHH_sequence | Amino acid sequence of VHH |
Ag_sequence | Amino acid sequence of IL-6 protein |
Ag_label | Type of IL-6 protein |
label | Binary label represented by 1 for the binding pair and 0 for the non-binding pair |
A labeled dataset contains 573,891 data samples, comprising 20,980 binding pairs and 552,911 non-binding pairs. The following figure shows the number of data samples for each antigen type.
A Labeled dataset is generated through the following workflow. The scripts highlighted in blue can be found in the dataset folder.
Here is how to generate a labeled CSV file from a raw FASTQ files. First, you need to create a Docker image.
docker build -t avida-hil6:latest .
After placing the raw data (FASTAQ files) under data/
, execute the following command to output a labeled CSV file: out/il6_aai_dataset.csv
.
./dataset/preprocess.sh
We recommend splitting the dataset based on the type of IL-6 protein to predict the impact of antigen mutations on antibody binding.
For example, to create a training set containing "IL-6_WTs" and "IL-6_G63A", filter the dataset using the Ag_label
column as follows.
dataset_df = pd.read_csv("il6_aai_dataset.csv")
training_df = dataset_df[dataset_df["Ag_label"].isin(["IL-6_WTs", "IL-6_G63A"])]
training_df.to_csv("training_set.csv", index=False)
We implemented the following three encoding methods. The CSV file will be converted to npz files by executing the following commands.
-
CKSAAP encoding for AbAgIntPre
python benchmarks/encodings/CKSAAP.py --data-path "training_set.csv" --file-name "train_CKSAAP"
-
Pre-trained skip-gram model based encoding for PIPR
python benchmarks/encodings/skipgram.py --data-path "training_set.csv" --file-name "train_skipgram"
-
One-hot encoding for MLP and LR
python benchmarks/encodings/onehot.py --data-path "training_set.csv" --file-name "train_onehot"
Arguments:
Argument | Required | Default | Description |
---|---|---|---|
--data-path | Yes | Path of the target CSV file to be preprocessed | |
--save-dir | No | "." | Directory to save the preprocessed npz file |
--file-name | No | "il6_aai_dataset" | Name of output npz file |
The model is trained using train-data
and evaluated using test-data
by executing the following commands.
Make sure to use the appropriate encoding npz file according to model-name
.
python benchmarks/train.py --train-data "train_CKSAAP.npz" \
--test-data "test_CKSAAP.npz" \
--model-name "AbAgIntPre" \
--epochs 100 \
--batch-size 256 \
--amp "True"
After training is completed, evaluation metrics for the test data are output.
Test: loss 0.0705, accuracy 0.9891, AUROC 0.9468, AUPRC 0.8286, precision 0.9452, recall 0.7488, f1 0.8356, MCC 0.8361
Arguments:
Argument | Required | Default | Description |
---|---|---|---|
--train-data | Yes | Path of training data | |
--test-data | Yes | Path of test data | |
--model-name | Yes | Model name ("AbAgIntPre" or "PIPR" or "MLP"]) | |
--valid-ratio | No | 0.1 | Ratio used for validation data from training data |
--save-dir | No | "./saved" | Save directory path |
--epochs | No | 20 | Number of epochs |
--batch-size | No | 256 | Size of mini-batch |
--amp | No | False | Use Automatic Mixed Precision to save memory usage |
--run-id | No | Run ID used for the directory name for saving the results | |
--model-path | No | Model path used for retraining |
The model is trained using train-data
and evaluated using test-data
by executing the following commands.
Make sure to use the one-hot encoding npz file for LR.
python benchmarks/train_lr.py --train-data "train_onehot.npz" --test-data "test_onehot.npz"
Arguments:
Argument | Required | Default | Description |
---|---|---|---|
--train-data | Yes | Path of training data | |
--test-data | Yes | Path of test data | |
--save-dir | No | "./saved" | Save directory path |
--run-id | No | Run ID used for the directory name for saving the results |
If you use AVIDa-hIL6 in your research, please use the following citation:
@inproceedings{tsuruta2023avida,
title={{AVID}a-h{IL}6: A Large-Scale {VHH} Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions},
author={Hirofumi Tsuruta and Hiroyuki Yamazaki and Ryota Maeda and Ryotaro Tamura and Jennifer N. Wei and Zelda Mariet and Poomarin Phloyphisut and Hidetoshi Shimokawa and Joseph R. Ledsam and Lucy Colwell and Akihiro Imura},
booktitle={Advances in Neural Information Processing Systems 36},
year={2023},
}