Skip to content

seitalab/HLARIMNT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HLARIMNT

Codes for training and evaluation, used in "Efficient HLA imputation from sequential SNPs data by Transformer".

(Part of our code is adopted from A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes.)

Proposal Part

The structure of the Transformer-based model we proposed in this study is stored in the following directory (Please also see the Some Notes section describing notes on Embedding Layer).

src/exp/codes/architectures/

In addition, detailed settings such as hyperparameters are described in the following configuration file.

src/exp/resources/settings.yaml

Data preparation

Pan-Asian reference panel is available here and T1DGC reference panel is available here after the registration process.

Put the directory that contains the reference panels in /PATH/TO/ORIGINAL_DATA . (Edit src/config.yaml and invoke_container.sh)

Experiment

  • Run the following command to move to the directory for experiment.
cd src/exp
  • Run the following command to create information file of the reference panel.
python make_hlainfo.py --ref REFERENCE_PANEL_NAME --data_dir DATA_DIR_NAME

ex) If reference panels are "Pan-Asian_REF.bim/bgl.phased" and they are stored in "Pan-Asian" directory, REFERENCE_PANEL_NAME will be "Pan-Asian_REF" and DATA_DIR_NAME will be "Pan-Asian".

  • Run the following command to create sample.bim, which contains only SNP information in the .bim file in reference panels.
python make_samplebim.py --ref REFERENCE_PANEL_NAME --data_dir DATA_DIR_NAME
  • Run the following command to train the model in partitioned data.
python run_train.py --ref REFERENCE_PANEL_NAME --data_dir DATA_DIR_NAME

This command will make a directory src/exp/REFERENCE_PANEL_NAME, which contains "outputs" directory and "processed_data" directory. The former one contains models and logs generated during the training in each fold and the latter one contains encoded data and data loaders for each fold.

If you don't want to perform a cross-validation (just want to train the model), you should set the variant "fold_num" in src/exp/resources/settings.yaml to 1 (default is 5).

  • Run the following command to calculate accuracy for each allele in test data of each fold.
python run_eval.py --ref REFERENCE_PANEL_NAME --data_dir DATA_DIR_NAME

This command will make a csv file outputs/FOLD_NUM/results/test_evals.csv for each fold (FOLD_NUM).

Some Notes

  • Data Preprocessing

This repository contains the

python make_samplebim.py --ref REFERENCE_PANEL_NAME --data_dir DATA_DIR_NAME

command to generate REFERENCE_PANEL_NAME_sample.bim. The REFERENCE_PANEL_NAME_sample.bim is an intermediate file obtained by extracting only the SNPs used for training from the SNP information in the reference panel. However, if you use data other than Pan-Asian and T1DGC, make_samplebim.py may not work as intended; you may need to make sure that REFERENCE_PANEL_NAME_sample.bim contains information only for the SNPs to be used for training. Please take appropriate action according to your reference panel and experiments.

  • About Embedding Layer

In Fig. 1(b) of the paper, the Embedding Layer is described as the process of splitting the SNPs data, adding a cls token, and then converting it to a feature vector. However, EmbeddingLayer/hlarimnt.py in this code only describes the operation of converting already-split data into feature vectors for implementation reasons. The process of splitting the data and adding the cls token is described in the make_train_data function of the DataProcessor class in src/exp/codes/data/dataset.py. make_train_data function is used in src/exp/codes/train_model.py (around line 100th).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published