AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions

This repository contains the supplementary material accompanying the paper, AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions. The dataset is available at https://avida-hil6.cognanous.com under a CC BY-NC 4.0 license.

Overview of the data generation process.

Environment

To get started, clone this repository and run the following command to create a virtual environment.

python3 -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt

Dataset

Downloads

Dataset	Download
Raw data (840 FASTQ file)	Link to Google Drive
VHH sequences data (420 FASTA file)	Link to Google Drive
Labeled dataset (1 CSV file)	Link to Zenodo

For more information, see the AVIDa-hIL6 project page.

Labeled Dataset

A labeled CSV file has four columns as shown below.

Column	Description
VHH_sequence	Amino acid sequence of VHH
Ag_sequence	Amino acid sequence of IL-6 protein
Ag_label	Type of IL-6 protein
label	Binary label represented by 1 for the binding pair and 0 for the non-binding pair

A labeled dataset contains 573,891 data samples, comprising 20,980 binding pairs and 552,911 non-binding pairs. The following figure shows the number of data samples for each antigen type.

Data Processing

A Labeled dataset is generated through the following workflow. The scripts highlighted in blue can be found in the dataset folder.

Here is how to generate a labeled CSV file from a raw FASTQ files. First, you need to create a Docker image.

docker build -t avida-hil6:latest .

After placing the raw data (FASTAQ files) under data/, execute the following command to output a labeled CSV file: out/il6_aai_dataset.csv.

./dataset/preprocess.sh

Benchmarks

Data Splitting

We recommend splitting the dataset based on the type of IL-6 protein to predict the impact of antigen mutations on antibody binding. For example, to create a training set containing "IL-6_WTs" and "IL-6_G63A", filter the dataset using the Ag_label column as follows.

dataset_df = pd.read_csv("il6_aai_dataset.csv")
training_df = dataset_df[dataset_df["Ag_label"].isin(["IL-6_WTs", "IL-6_G63A"])]
training_df.to_csv("training_set.csv", index=False)

Sequence Encoding

We implemented the following three encoding methods. The CSV file will be converted to npz files by executing the following commands.

CKSAAP encoding for AbAgIntPre

python benchmarks/encodings/CKSAAP.py --data-path "training_set.csv" --file-name "train_CKSAAP"

Pre-trained skip-gram model based encoding for PIPR

python benchmarks/encodings/skipgram.py --data-path "training_set.csv" --file-name "train_skipgram"

One-hot encoding for MLP and LR

python benchmarks/encodings/onehot.py --data-path "training_set.csv" --file-name "train_onehot"

Arguments:

Argument	Required	Default	Description
--data-path	Yes		Path of the target CSV file to be preprocessed
--save-dir	No	"."	Directory to save the preprocessed npz file
--file-name	No	"il6_aai_dataset"	Name of output npz file

Model Training and Evaluation

AbAgIntPre, PIPR and MLP

The model is trained using train-data and evaluated using test-data by executing the following commands. Make sure to use the appropriate encoding npz file according to model-name.

python benchmarks/train.py --train-data "train_CKSAAP.npz" \
  --test-data "test_CKSAAP.npz" \
  --model-name "AbAgIntPre" \
  --epochs 100 \
  --batch-size 256 \
  --amp "True"

After training is completed, evaluation metrics for the test data are output.

Test: loss 0.0705, accuracy 0.9891, AUROC 0.9468, AUPRC 0.8286, precision 0.9452, recall 0.7488, f1 0.8356, MCC 0.8361

Arguments:

Argument	Required	Default	Description
--train-data	Yes		Path of training data
--test-data	Yes		Path of test data
--model-name	Yes		Model name ("AbAgIntPre" or "PIPR" or "MLP"])
--valid-ratio	No	0.1	Ratio used for validation data from training data
--save-dir	No	"./saved"	Save directory path
--epochs	No	20	Number of epochs
--batch-size	No	256	Size of mini-batch
--amp	No	False	Use Automatic Mixed Precision to save memory usage
--run-id	No		Run ID used for the directory name for saving the results
--model-path	No		Model path used for retraining

LR

The model is trained using train-data and evaluated using test-data by executing the following commands. Make sure to use the one-hot encoding npz file for LR.

python benchmarks/train_lr.py --train-data "train_onehot.npz" --test-data "test_onehot.npz"

Arguments:

Argument	Required	Default	Description
--train-data	Yes		Path of training data
--test-data	Yes		Path of test data
--save-dir	No	"./saved"	Save directory path
--run-id	No		Run ID used for the directory name for saving the results

Citation

If you use AVIDa-hIL6 in your research, please use the following citation:

@inproceedings{tsuruta2023avida,
  title={{AVID}a-h{IL}6: A Large-Scale {VHH} Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions},
  author={Hirofumi Tsuruta and Hiroyuki Yamazaki and Ryota Maeda and Ryotaro Tamura and Jennifer N. Wei and Zelda Mariet and Poomarin Phloyphisut and Hidetoshi Shimokawa and Joseph R. Ledsam and Lucy Colwell and Akihiro Imura},
  booktitle={Advances in Neural Information Processing Systems 36},
  year={2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
analysis		analysis
benchmarks		benchmarks
dataset		dataset
docs/images		docs/images
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions

Table of Contents

Environment

Dataset

Downloads

Labeled Dataset

Data Processing

Benchmarks

Data Splitting

Sequence Encoding

Model Training and Evaluation

AbAgIntPre, PIPR and MLP

LR

Citation

About

Releases

Packages

Languages

License

cognano/AVIDa-hIL6

Folders and files

Latest commit

History

Repository files navigation

AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions

Table of Contents

Environment

Dataset

Downloads

Labeled Dataset

Data Processing

Benchmarks

Data Splitting

Sequence Encoding

Model Training and Evaluation

AbAgIntPre, PIPR and MLP

LR

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages