GitHub - aasoni/genomics-project: JHU Spring 2013: genomics final project

aasoni / genomics-project Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

JHU Spring 2013: genomics final project

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Bacillus_anthracis		Bacillus_anthracis
Clodis_difficile		Clodis_difficile
Ecoli		Ecoli
hmm		hmm
markov_chain		markov_chain
monocytogenes		monocytogenes
species_test		species_test
staph		staph
tuberculosis		tuberculosis
0_threshold.pdf		0_threshold.pdf
BloomFilter.java		BloomFilter.java
README		README
aasoni1_final_project_writeup.pdf		aasoni1_final_project_writeup.pdf
bloom_driver.java		bloom_driver.java
enh_fb.fa		enh_fb.fa
extract_features.py		extract_features.py
nullseqsi_200_1.fa		nullseqsi_200_1.fa
result_table		result_table
threshold.pdf		threshold.pdf
train_test_creator.py		train_test_creator.py
useful_links		useful_links

Repository files navigation

Computational Genomics 2013 JHU Final Project

Put main notes here: year month day
13.4.12
enhancer data:
sequences around 800bp long
kmer size to use: 20
number of enhancer sequences: 2500
number of negative (not enhancer) sequences: 4000

use java for bloomfilter

-working on javabloomfilter - Kyle 13.4.12
-working on kmers (python) Neighborhoods (hamming distance) - Alessandro 13.4.12
-currently working on hmm tutorial in matlab (profile analysis) - Guannan 13.4.14

-first draft of kmer neighberhood feature extraction. (terrible performance) - Alessandro 13.4.14
-first draft of bloomfilter: Kyle 13.4.17 (runs fast but ~50% accuracy with defaults)
kmer_size | %positive called correctly | % negative called correctly | average difference of +/- kmer calls per read
100     46.00%      52%       5.9293
50      49.75%      50.4%     7.1878
30      54.81%      54.5%     7.8611
20      50.89%      54.5%     8.5579
10      0.16%       99.9%     67.8815
5       0%          100%      0

I think as the kmer size gets small, the number of false positives 
from having a larger negative input file causes everything to be identified
as a negative read (not classified as enhancer).

Either way, the default hash functions end up with only about 50% correctness.