Skip to content

young55775/EMSForest-A-Machine-Learning-Enhanced-EMS-Mutagenesis-Probability-Map

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EMS Mutagenesis Probability Map

Introduction

Based on the Random Forest model, our pipeline utilizes Whole Genome Sequencing (WGS) data from genetic suppressor screenings to identify causative mutations without the need to generate recombinant inbred lines.

image

Publication

Zhengyang Guo, Shimin Wang, Yang Wang, Zi Wang, and Guangshuo Ou. (2024). A Machine Learning Enhanced EMS Mutagenesis Probability Map for Efficient Identification of Causal Mutations in Caenorhabditis elegans.

Table of Contents

Getting Started

Dependencies

The pipline requires the following packages:

You can install these packages from PyPI using pip:

pip install scipy pandas tqdm matplotlib

Installation

The model and the gene range file can be downloaded from Zenodo.
The python script can be downloaded from this repository.

gene range file can be customized according to the newest version of genome annotation.

Usage

Parameters

Option Description
--model The path to the model file
--data The directory where the VCF file is located
--ref The path to the gene range file
--threshold The threshold for excluding background mutations. When the same variation is observed n times in your data, it will be considered a background mutation and discarded.
--out The path where the output file will be saved
--background (Optional) The path to the background file which contains the WGS data of a pre-mutated worm

The gene range file should be a CSV file with four columns. The first column represents the gene name, the second column denotes the chromosome ( 'I', 'II', ..., 'X') where the gene is located, the third column indicates the starting position, and the fourth column signifies the end position.

Examine the output

A volcano_table.csv file and a volcano.jpg file should have been generated in the output folder.

The CSV file contains the fold change and p-value for each gene showing mutations in the background-removed mutation pool. You can find the candidate gene by sorting by p-value.

image

The JPG file is a volcano plot drawn based on the CSV file.

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages