GenomeMatrixProcessing

Python functions and scripts for parsing regions out of a genome matrix file (see http://1001genomes.org/data/MPI/MPICao2010/releases/current/genome_matrix/) and converting to other another format (i.e. FASTA) or doing analysis directly on the extracted region

Useage

The central script in this pipeline is "0_GetGenomeMatrixLineIndex.py" which for each sequence in the genome matrix, created a single column index [target_file].index.[seq] where there value of the Nth line is the offset for reading the line in the genome matrix file which represent the Nth base of a chromsome. A reference file, [target_file].index.ref while recording the original target file and the relationship between sequence and each index

Once the index for the genome matrix file has been created, the following script can be used to process the genome matrix:

1a_GetRegions.py: extracts regions from the the genome matrix file into a single file where each region is seperated by a header line beginning with ">" *The regions file is a three column file with the following format: [seq]\t[start]\t[stop]\n *The name of each region in the header line will be [seq][start][stop]
1b_GetRegionFASTA.py: extracts the base sequeces of region from the genome matrix and write the sequences of each region into a seperate multi-FASTA file *Uses the same regions file format as above *Outfile files have the following format [region_file].[seq][start][stop].fasta *Sequences within each fasta are named sequencetial, beginning with "0" as the referrence sequence
1c_GetRegionStats.py: calculates stats (Nt Diversity and Tajima's D) for the sequences in a defined region *Uses the same regions file format as above *Outfiles have the following format: [Region]\t[NtDiversity]\t[TajimasD]\n *If there are no differential sites in a region, "NoDiffSites" will in that cell instead of a float value
1d_GetRegionStats_fromBED.py: preforms the same function as 1c_GetRegionStats.py but uses a standard BED format file instead of a regions file *BED format file should be in the following format: [Seq]\t[Start]\t[Stop]\t[SeqName]\n. All four values will be used to name regions *Output format is the same as 1c_GetRegionStats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenomeMatrixProcessing

Useage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
0_GetGenomeMatrixLineIndex.py		0_GetGenomeMatrixLineIndex.py
1a_GetRegions.py		1a_GetRegions.py
1b_GetRegionFASTA.py		1b_GetRegionFASTA.py
1c_GetRegionStats.py		1c_GetRegionStats.py
1d_GetRegionStats_fromBED.py		1d_GetRegionStats_fromBED.py
README.md		README.md

ShiuLab/GenomeMatrixProcessing

Folders and files

Latest commit

History

Repository files navigation

GenomeMatrixProcessing

Useage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages