Deletion Overlap Fingerprint Utility

Goals:

The goals of this workflow are to use reference genome deletions in order to make inferences about genome similarity and lineage. This work is part of my undergraduate research under Professor Volker Brendel at Indiana University.

The python code looks to find all deletions containing overlaps of a specified percentage. We are aiming to find a "sweet spot" set of parameters that are the most useful.

The expected file input is a set of .perGap files that results from a workflow that Dr. Chun-Yuan Huang developed and implemented. Huang's workflow is availible at: https://github.com/huangc/WGvarINDEL

The code can easily be modified to support other file formats such as .psl and .bed, if you are interested in finding overlaps of features.

Requirements:

This code is fairly computationally cheap due to the O(logN) efficiency of interval tree comparisons. Therefore, you should be able to run this on a local machine with a few GB of RAM.

There are several Python libraries this code makes use of that you must download. A quick how-to guide for downloading Python packages is availible at: http://python-packaging-user-guide.readthedocs.io/en/latest/installing/

The non-standard libraries you must possess to run this code are:

intervaltree
xlwt

Input Parameters:

There are several input parameters availible to change at the top of the python code. Python Input:

Number of chromosomes on the genome
Number of inputted .perGap files
Minimum deletion length
Top "N" number of longest deletions to be looked at
Threshold for overlap percentage

The command line input simply requires the filepaths to each of the .perGap files you wish to compare. Command Line Input:

file names (at least 2 files are required)

Expected command line input: python program_name.py file1 file2 ... file8 > results.txt

Outputs:

There are two significant outputs of this file:

The printed outputs to stdout, which you should probably pipe into a .txt file as shown above.
- This file contains every overlap found during the search
A .xls file made using the .xlwt file.
- This .xls file is created in the same directory that the Python code is executed in.
- This .xls file contains a few sheets: Sheet 1 contains all overlaps found for each of the top N deletions. Sheet 2 contains the total percentage of nucleotide bases that overlap between each cultivar input. Sheet 3 contains the counts of overlaps that meet or exceed the overlap threshold set in the Python parameters between each cultivar input.

Notes:

Currently, there is not a functionality to name each output file by the inputted files. For now, the output is YBX depending on the order of inputted files.

eg. "python program_name.py file1 file2" produces YB1 and YB2 as the names for file1 and fil2 respectively.

Future Plans:

Implement a data analytics Python program using the Python libary pandas.
Implement useful graphs using the matplotlib and scipy Python libraries
Add different options for output other than .xls such as .csv and tab-delimited .txt

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
del_overlap_count_v1.py		del_overlap_count_v1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deletion Overlap Fingerprint Utility

Goals:

Requirements:

Input Parameters:

Outputs:

Notes:

Future Plans:

About

Releases

Packages

Languages

kjkunkle/del_overlap

Folders and files

Latest commit

History

Repository files navigation

Deletion Overlap Fingerprint Utility

Goals:

Requirements:

Input Parameters:

Outputs:

Notes:

Future Plans:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages