Skip to content

ucrbioinfo/AMPS

Repository files navigation

AMPS

AMPS is a python tool for context-specific DNA methylation prediction using a deep neural network. The prediction can be based on the sequence, sequence+annotation, or methylation level of neighboring Cytosines.

Dependencies

  1. Tensorflow 2.3.+
  2. biopython
  3. pandas
  4. numpy

Preprocess

Mapping WGBS data

If you have the WGBS reads, you should first get to the mapping process. The file mapping_scripts.txt shows a sample for the steps:

  • Check the quality of the reads
  • Trim the reads with low quality
  • Map the reads using Bismark:
    • Prepare genome
    • Map the reads
    • genome-wide methylation extractor

Inputs

AMPS uses three inputs:

  • DNA Sequence, which should be in Fasta format
  • Methylations, which contains the methylation and context information of cytosines. This file is based on the output of the methylation extractor of Bismark

    chromosome position strand count-methylated count-unmethylated C-context trinucleotide-context

  • Annotaions, Which is a table containing the annotated function element or repeat coordinations and must be in GFF3 format.

Sequence based

train.py is the module for training the AMPS. AMPS will be trained for the methylation prediction from sequence and annotation. If the annotation file is not passed to the module, it will be trained based on the sequence only. Module options:

  1. -m, --methylation_file: methylation file address, required
  2. -g, --genome_assembly_file: genome sequence file address, must be in fasta format, required
  3. -c, --context: context, required
  4. -ga, --gene_file: gene annotation file address
  5. -ra, --repeat_file: repeat annotation file address
  6. -iga, --include_gene: does the predictor include the gene annotation in the input? True/False
  7. -ira, --include_repeat: does the predictor include the repeat annotation in the input? True/False
  8. -tr, --train_size: training dataset size, number of inputs for training
  9. -ct, --coverage_threshold: minimum number of reads for including a cytosine in the training/testing dataset
  10. -on, --organism_name: sample name, for saving the files
  11. -mcs, --memory_chunk_size: number of inputs in each memory load

As a sample you can run:

python train.py -m ./sample/sample_methylations_train.txt -g ./sample/sample_seq.fasta -ga ./sample/sample_gene_annotation.txt -ra ./sample/sample_repeat_annotation.txt -c CG

This module will train a model and save it in the ./models/ directory. The saved model can be loaded and used for the desired set of cytosines. For using the model test.py should be used.

test.py loads the trained model to predict the binary methylation status of all the cytosines listed in the methylation file. The output of the prediction is a binary vector. Each vector element corresponds to a cytosine in the provided methylation file. This vector will be saved in the ./output/ folder. Module options:

  1. -m, --methylation_file: methylation file address, required
  2. -mdl, --model_address: trained model address, required
  3. -g, --genome_assembly_file: genome sequence file address, must be in fasta format, required
  4. -ga, --gene_file: gene annotation file address
  5. -ra, --repeat_file: repeat annotation file address
  6. -iga, --include_gene: does the predictor include the gene annotation in the input? True/False
  7. -ira, --include_repeat: does the predictor include the repeat annotation in the input? True/False
  8. -on, --organism_name: sample name, for saving the files

As a sample you can run:

python test.py -mdl ./models/sample_organismCG.mdl/ -m ./sample/sample_methylations_test.txt -g ./sample/sample_seq.fasta -ga ./sample/sample_gene_annotation.txt -ra ./sample/sample_repeat_annotation.txt

The output is a text file containing a binary vector saved in ./output/ folder.

For calculating the accuracy, you can use accuracy_clc.py module. It gets the predicted binary vector and either a methylation file or another binary vector. If a methylation file is provided in the input, the module calculates the methylation status of each cytosine in the methylation file and then compares it with the predicted binary vector. Else two binary vectors are compared together. The accuracy is the number of correct predictions over all the test size.

Module options:

  1. -pr, --y_predicted: address to the predicted binary vector file, required
  2. -te, --y_true: address to true methylation status binary vector file
  3. -m, --methylation_file: address to ground truth methylation file. The methylation file must contain all the columns of bismark output

sample code:

python accuracy_calc.py -pr ./output/sample_organism.txt -m ./sample/sample_meth_profile_test_ground_truth.txt

Methylation-profile based

The train_methprofile.py trains a model for cytosine methylation prediction based on its neighboring Cytosine methylation levels. Module options:

  1. -m, --methylation_file: methylation file address, required
  2. -c, --context: context, required
  3. -tr, --train_size: training dataset size, number of inputs for training
  4. -ct, --coverage_threshold: minimum number of reads for including a cytosine in the training/testing dataset
  5. -on, --organism_name: sample name, for saving the files

A sample code to run this module on the sample data:

python train_methprofile.py -m ./sample/sample_methylations_train.txt -c CG

The trained model will be saved in the ./models/ folder. Then by using the test_methprofile.py for a sample of cytosines, the binary methylation status can be predicted. This module's input is a profile of a set of cytosines provided in a tab-separated file. Each row of the file should contain the methylation levels of the neighboring cytosines. For example, below is a cytosine profile with a window size of 20 centered on the unknown cytosine(methylation levels of 10 cytosines downstream and ten cytosines upstream)

0.76190476, 0.67857143, 0.6875 , 0.94366197, 1. , 0.88235294, 0.6875 , 0.91304348, 0.94444444,1. , 0.92 , 0.8125 , 0.91666667, 0.81481481, 0.82758621, 0.60606061, 0.95833333, 1. , 1. , 0.92307692 This can be a row in the cytosine profiles file. The inputs of the test_methprofile.py module are:

  1. -p, --prfiles_address: address to the file containing the cytosine profiles. a tab separated file, each row is the methylation level of neighboring Cytosines, required
  2. -mdl, --model_address: trained model address, required
  3. -on, --organism_name: sample name, for saving the files

The output is a text file containing a binary vector saved in ./output/ folder.

A sample code to run test_methprofile.py on sample data:

python test_methprofile.py -p ./sample/sample_meth_profile_test.txt -mdl ./models/sample_organismCG_methprofile.mdl

Methylation-profile + Sequence based prediction

Another option for cytosine methylation prediction is to use both neighboring cytosines methylation levels along with sequence and annotations. For this we developed two seperate modules to train and test a model based on this input set. To train a module using bothe methylation profiles and sequence based features you can use train_combo.py. Module options:

  1. -m, --methylation_file: methylation file address, required
  2. -g, --genome_assembly_file: genome sequence file address, must be in fasta format, required
  3. -c, --context: context, required
  4. -ga, --gene_file: gene annotation file address
  5. -ra, --repeat_file: repeat annotation file address
  6. -iga, --include_gene: does the predictor include the gene annotation in the input? True/False
  7. -ira, --include_repeat: does the predictor include the repeat annotation in the input? True/False
  8. -tr, --train_size: training dataset size, number of inputs for training
  9. -ct, --coverage_threshold: minimum number of reads for including a cytosine in the training/testing dataset
  10. -on, --organism_name: sample name, for saving the files
  11. -mcs, --memory_chunk_size: number of inputs in each memory load

It will train a model and save it to models folder.

A sample code to run this module on the sample date:

python train_combo.py -m ./sample/sample_methylations_train.txt -g ./sample/sample_seq.fasta -ga ./sample/sample_gene_annotation.txt -ra ./sample/sample_repeat_annotation.txt -c CG

To test the model and predict predict unkonwn cytosine methylation statuses based on methylation-profile and sequence based features you can use test_combo.py module. Module options:

  1. -m, --methylation_file: methylation file address, required
  2. -mdl, --model_address: trained model address, required
  3. -g, --genome_assembly_file: genome sequence file address, must be in fasta format, required
  4. -p, --prfiles_address: address to the file containing the cytosine profiles. a tab separated file, each row is the methylation level of neighboring Cytosines, required
  5. -ga, --gene_file: gene annotation file address
  6. -ra, --repeat_file: repeat annotation file address
  7. -iga, --include_gene: does the predictor include the gene annotation in the input? True/False
  8. -ira, --include_repeat: does the predictor include the repeat annotation in the input? True/False
  9. -on, --organism_name: sample name, for saving the files

A sample code to use this test module on the sample data:

python test_combo.py -p ./sample/sample_meth_profile_test.txt -mdl ./models/sample_organismCG_combo.mdl -m ./sample/sample_methylations_test.txt -g ./sample/sample_seq.fasta -ga ./sample/sample_gene_annotation.txt -ra ./sample/sample_repeat_annotation.txt

And the predicted results will be saved on ./output folder

Motif Finding

The interpretability of the module is implemented by motif_finding.py module. This module gets a pre-trained model and a number of sequences in a fasta file and writes out a file that contains the important part of each sequence. The output file is in .fasta format and will be saved in ./motifs/ directory. This module uses Grad-CAM for finding the activation map vector. After calculating the activation map it selects the most important sub-sequence by sliding a window of length fifty along the input and reporting the window with the highest average of the activation map vector. You can give the output of this module to MEME and TOMTOM for finding important motifs matching in the motif Databases

  1. -mdl, --model_address: trained model address, required
  2. -seqs, --sequence_file: fasta file containing the sequences which you want to find the motifs in them., required
  3. -ms, --motif_size: size of motifs to search in the input set
  4. -o, --output: output_file_name

As a sample you can run this over a sample of sequences provided in this repository:

python motif_finding.py -mdl ./models/sample_organismCG.mdl -seqs ./sample/motif_input_sample.fa

GeneBody methylation

The methylation analysis in the gene-body and flanking regions are implemented in the gene_body_analysis.py . This module divides the flanking regions and gene body into several bins and then in the genome-wide calculates the average methylation in each bin. Module inputs:

  1. -m, --methylation_file: methylation file address, required
  2. -g, --genome_assembly_file: genome sequence file address, must be in fasta format, required
  3. -c, --context: context, required
  4. -a, --annotation_file: annotation file address
  5. -ct, --coverage_threshold: minimum number of reads for including a cytosine in the training/testing dataset
  6. -bn, --bin_number: number of bins for genebody and flanking regions

The output is two NumPy vectors, each containing the average methylation levels for the bins in the template or nontemplate strands. The numbers come in the order of downstream flanking region, gene body, and upstream flanking region.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published