This repository has been archived by the owner on Aug 25, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathWrite_An_Assembler.py
917 lines (749 loc) · 39.9 KB
/
Write_An_Assembler.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>
# <markdowncell>
# Write A Genome Assembler with ``blasr`` and ``(I)Python``
# ==========================================================
# <markdowncell>
# Introduction
# ============
#
# I am not sure it is right thing to do some coding in a vacation. (See this tweet
# https://twitter.com/pm/status/284102222416056320, not sure if I would agree.)
# Anyway, before the vacation, I decided to organize whole bunch of random papers
# I collected in the last few years in my laptop. I eventually felt that I should
# read some of some theoretical papers about genome assembly again. I grabbed Gene
# Meyer's paper (
# http://bioinformatics.oxfordjournals.org/content/21/suppl_2/ii79.abstract ),
# and started to think about the problem about constructing unitigs
# (high-confident contigs). Before I tried to read the paper in detail, I just
# felt maybe that it was useful to write some quick code to check out what real
# data looked like. I started writing some code visualizing the local overlapping
# and generating the global overlapping graph. It was pretty straight forward and quite
# inspiring from the visualization. The visualization motived me to write
# more ad-hoc code in IPython to go beyond generating simple visualization during the
# vacation. I started to implement a very simple greedy algorithm to connect the
# input DNA fragments. Eventually, I found I can atucally get the whole genome
# assembly right!! This Notebook shows the work step by step toward a simple genome
# assembler for long read data using ``IPython`` and ``Python``.
#
# The input data was the E. coli K12 data from
# https://github.com/PacificBiosciences/DevNet/wiki/Hierarchical-Genome-Assembly-Process-%28HGAP%29.
# I generated a new set of pre-assembled reads as the input DNA fragment using
# HBAR-DTK ( https://github.com/PacificBiosciences/HBAR-DTK ). The statistics of
# the input reads is shown below:
#
# Pre-assembled Read Stats
# #Seqs 17497
# Min 448
# 1st Qu. 5365
# Median 6149
# Mean 6193
# 3rd Qu. 7189
# Max 14762
# Total 108363627
# n50 6521
# n90 4694
# n95 3864
#
# This is about 23x of average length of 6.2kb reads of E. coli K12 genome. The
# mean sequence identity when aligning the pre-assembled reads to the canonical
# E. coli K12 reference is 99.787% (median identity = 99.921%). The pre-assembled
# read data is available for download from https://www.dropbox.com/s/ybsxsfe5ppb04hx/pre_assembled_reads.fa. According the
# Lander-Waterman statistics, this set of reads is enough to get single contig
# for the single chromosome. The only thing one needs to worry about is that
# whether the repeats in genome can be uniquely. Since it is shown one can use
# Celera Assembler to assemble this data set into one single contig, this implies
# that all of the repeats of the rRNA operon can be resolved from this data set.
# Typically, one might need multiple libraries, especially those long jumping
# libraries, to resolve long repeats from short read (~100bp) data. For example,
# ALLPATHS-LG can use multiple short-read libraries and long read data to get
# perfect assemblies for bacteria
# ( http://genome.cshlp.org/content/early/2012/07/24/gr.141515.112.abstract and
# http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=14 ). However,
# when one can get long range information from one single library, the assembly
# process can be probably greatly simplified. As demonstrated here, one can
# actually get pretty good assembly using Python code with blasr for overlapping and
# Quiver for polishing the residual errors.
#
# For long read data, the de Bruijn graph with some sophisticated graph
# operations to get around the limit of short read data to assemble genomes may
# not be the right approach. The Overlap-Layout-Consensus can be much more
# efficient for assembling long reads. Also, when the reads are getting longer,
# the overlapping between the reads becomes more and more specific. This will
# significantly reduce the complexity of the overlapping graph between the reads.
# (Here we define an overlapping graph as a graph where the nodes are reads and
# the edges are constructed from the pairs of proper overlapped reads.) When the
# overlapping graph is simple, it becomes possible to use some naive algorithm to
# get a reasonably good assembly. To make this clear, one can image if the read
# lengths are as long as the genome, the number of overlapping one needs to worry
# about will be in single digit and there will be no ambiguity at all. The
# overlapping and layout will be trivial in such extreme case.
#
# When I use Celera Assembler (CA) to assemble genomes, I always try to peek the
# intermediate data in order to understand the process better and maybe fix few
# things manually. Michael Schatz (http://schatzlab.cshl.edu/ , twitter:
# @mike_schatz ) and Adam Phillippy ( twitter: @aphillippy ) both pointed to me
# some very useful information about CA. (Thanks, Michael and Adam.)
# ``tigStore`` is a very useful command to see the relationship between the
# generated unitigs and the initial DNA fragments. The file
# ``4-unitigger/best.edges`` contains the information best overlapped reads which
# one can use to construct the overlapping graph and one can learn a lot subtle
# points about genome assembly by visualizing those. This notebook is definitely
# inspired by the great work done by the CA's developers.
#
# This IPython notebook is just a "Notebook". I write these code for fun and for
# (self-)educational purposes. It shows that given the good long read data and
# the proper tools ``blasr`` ( https://github.com/PacificBiosciences/blasr ),
# ``pbdagcon/HBAR-DTK`` ( https://github.com/PacificBiosciences/pbdagcon ,
# https://github.com/PacificBiosciences/HBAR-DTK ), and ``quiver``
# ( https://github.com/PacificBiosciences/GenomicConsensus) one can learn how to
# do genome assembly with a "not-so-high-performance" language ``Python``. The
# great ``IPython Notebook`` ( http://ipython.org/ ) allows me to try out different
# code snippet and different algorithm ideas without a lot of overhead. One
# should notice this code in this ``IPython Notebook`` is mainly for fun, not for
# "production". I chose the least resistant path for coding. Coding style and
# code performance were considered but not taking seriously as I was doing a lot
# of experiments for different algorithm approaches when I coded this notebook.
#
# I consider this Notebook is as a personal project for fun. The data and the
# used software can be all downloaded publicly. Although I am an employee of
# PacBio(R), this notebook is developed in my own time. PacBio will not be
# responsible for any information associated with this Notebook.
#
# --Jason Chin, Apr 2, 2013
#
# <markdowncell>
# Get Data
# --------
# I put the pre-assembled read file in DropBox. The URL is https://www.dropbox.com/s/ybsxsfe5ppb04hx/pre_assembled_reads.fa.
# For most of part of the Notebook, one does not need the fasta file. You can also visit https://www.dropbox.com/sh/m2bjpwdua6t3iuk/3xgt1Zi9Go to see other data files in the same DropBox folder.
# <markdowncell>
# Overlapping
# -----------
# The first step to assemble the genome is to use ``blasr`` to find the overlapping between the reads. This can be done by executing this commend:
#
# blasr pre_assembled_reads.fa pre_assembled_reads.fa -m 4 -nproc 32 -bestn 32 -maxLCPLength 15 -nCandidates 36 -out pr_pr_bn32.m4 -noSplitSubreads
#
# We ask ``blasr`` to output the coordinates of the alignment between the reads by specified the ``-m 4`` option.
# The ``-nproc 32`` will let ``blasr`` to use 32 threads. For each reads, we allow ``blasr`` to find the other 32 best overlaps by using ``-bestn 32`` and scaning 36 potential hit candidateswith ``-nCandidates 36``. The ``-noSplitSubreads`` has only cosmatic effect here. It makes the ouput query identifier cleaner.
#
# ``blasr`` is a general purpose aligner. It is not really tuned to do overlapping work. One will see sometimes it does not handle the ends of the alignment as one would like it to do. The ``containment_tolerance`` variables used below was introduced for being less picky at the ends of the alignment. However, one would need to worry about potential mis-assemblies if the tolerance is too big.
#
# The aligned results can be found in the DropBox folder mentioned above. Or, one can download it from this link https://www.dropbox.com/s/64rq78l9t74afkb/pr_pr_bn32.m4.
# <markdowncell>
# "``blasr -m 4``" output format
# ------------------------------
# The cells below show the content of the ``m4`` file. It containes the alignment information between the reads. The fields are ``query identifier``, ``target identifier``, ``alignment score (more negative -> better alignment)``, ``alignment identity``, ``query strand``, ``query start``, ``query end``, ``query length``, ``target strand``, ``target start``, ``target end``, ``target length``. The rest of the fields are not used. One trap here is that the "``target start``" and "``target end``" are not relative to the original sequence if "``target strand``" is ``1`` . When "``target strand= 1``", it means that the reverse-complimentary strand is aligned and the "``target start``" and "``target end``" are the coordinates in the reverse complimentary strand. In the stand ``blasr`` output, "``query strand``" is always ``1``.
# <codecell>
#!head -10 pr_pr_bn32.m4
# <markdowncell>
# Convert ``blasr``'s output to python internal data
# --------------------------------------------------
#
# We deinfe ``read_node`` class for the objects that holds the local overlapping information of each read. It simply holds a ``python`` dictionary (hash_map) where the key is the ``target identifier`` and the elements from ``m4`` line are converted to the right types and stored. Only "non-contaminated"" alignments are stored. If one read is full aligned within the other read, such alignment is not stored in a ``read_node``. The ``get_overlap_data`` function processes the ``m4`` file and determine whether the alignments are ``contaminated`` with some tolerence of the ends of the alignments.
# <codecell>
import sys
import os
class read_node_0: #experimenting on keeping the data in a list rather than a dictionary, not used
def __init__(self, name, length):
self.name = name
self.length = length
self.hits = []
self.hit_map = None
def add_hit(self, aln_info):
self.hits.append ( aln_info )
def __getitem__(self, target_name):
if self.hit_map == None:
self.hit_map = dict(zip( [h[0] for h in self.hits], self.hits ) )
return self.hit_map[target_name]
class read_node:
def __init__(self, name, length):
self.name = name
self.length = length
self.hit_map = {}
def add_hit(self, aln_info):
self.hit_map[aln_info[0]] = aln_info
def __getitem__(self, target_name):
return self.hit_map[target_name]
@property
def hits(self): #another convenient representation of the same data
return self.hit_map.values()
def get_overlap_data(m4_filename):
containment_tolerance = 50
permitted_error_pct = 2
overlap_data = {}
contained_reads = set()
with open(m4_filename) as m4_f:
for l in m4_f:
l = l.strip().split()
q_name, t_name =l[0:2]
if q_name == t_name:
continue
aln_score = int(l[2])
aln_idt = float(l[3])
if aln_idt < 100-permitted_error_pct:
continue
q_strand, q_start, q_end, q_len = ( int(x) for x in l[4:8])
t_strand, t_start, t_end, t_len = ( int(x) for x in l[8:12])
if q_len - (q_end - q_start) < containment_tolerance:
contained_reads.add(q_name)
if t_len - (t_end - t_start) < containment_tolerance:
contained_reads.add(t_name)
if q_name not in overlap_data:
overlap_data[ q_name ] = read_node(q_name, q_len)
assert q_strand == 0
if t_name not in [x[0] for x in overlap_data[ q_name ].hits]:
overlap_data[ q_name ].add_hit( (t_name,
aln_score,
aln_idt,
(q_strand, q_start, q_end, q_len),
(t_strand, t_start, t_end, t_len) ) )
#symmetrized the alignment record
if t_name not in overlap_data:
overlap_data[ t_name ] = read_node(t_name, t_len)
if q_name not in [x[0] for x in overlap_data[ t_name ].hits]:
if t_strand == 1:
t_start, t_end = t_len - t_end, t_len - t_start
q_start, q_end = q_len - q_end, q_len - t_start
overlap_data[ t_name ].add_hit( (q_name,
aln_score,
aln_idt,
(0, t_start, t_end, t_len),
(1, q_start, q_end, q_len) ) )
else:
overlap_data[ t_name ].add_hit( (q_name,
aln_score,
aln_idt,
(0, t_start, t_end, t_len),
(0, q_start, q_end, q_len) ) )
return overlap_data, contained_reads
# <codecell>
overlap_data, contained_reads = get_overlap_data("pr_pr_bn32.m4")
# <markdowncell>
# A simple function to convert the overlapping data to a GML file
# ---------------------------------------------------------------
#
# The ``generate_overlap_gml`` takes the ``overlap_data`` and the ``contained_reads`` to generate a overlap graph in ``GML`` format. (You will need to install ``networkx`` in your ``Python`` environment to use this code.)
# <codecell>
def generate_overlap_gml(overlap_data, contained_reads, gml_filename):
containment_tolerance = 50
permitted_error_pct = 2
import networkx as nx
G=nx.DiGraph()
node_in_graph = set()
for q_name in overlap_data:
if q_name in contained_reads:
continue
if q_name not in node_in_graph:
G.add_node(q_name)
targets = overlap_data[ q_name ].hits
targets_3prime = [ h for h in targets if h[4][1] < containment_tolerance and h[0] not in contained_reads]
targets_5prime = [ h for h in targets if h[3][1] < containment_tolerance and h[0] not in contained_reads]
targets_3prime.sort(key = lambda k:k[1])
targets_5prime.sort(key = lambda k:k[1])
if len(targets_3prime) > 0:
t = targets_3prime[0]
t_name = t[0]
if t_name not in node_in_graph:
G.add_node(t_name)
G.add_edge(q_name, t_name)
if len(targets_5prime) > 0:
t = targets_5prime[0]
t_name = t[0]
if t_name not in node_in_graph:
G.add_node(t_name)
G.add_edge(q_name, t_name)
nx.write_gml(G, gml_filename)
# <codecell>
generate_overlap_gml(overlap_data, contained_reads, "overlap_graph.gml")
# <markdowncell>
# Overlapping graph visualized by Gephi
# -------------------------------------
#
# Ther generated overlap_graph.gml can be visualized by Gephi ( https://gephi.org/
# ). One can use "YifanHu' Multilevel" graph layout algorithm to do the rough
# layout followed by "ForceAtlas 2" algorithm to smooth the graph. The cell below shows the
# global overlapping structure. One can see that it is topologically a circle plus
# a number of singletons. Namely, the read data does capture the whole genome in
# one single circular layout.
# <codecell>
from IPython.display import Image
Image(filename = "overlap_graph01.png")
# <markdowncell>
# We can also see some local "bubbles" in the graph.
# <codecell>
Image(filename = "overlap_graph02.png")
# <codecell>
Image(filename = "overlap_graph03.png")
# <markdowncell>
# Some utility constants, functions and a class for viewing the local overlaping with ``SVG`` files
# --------------------------------------------------------------------------------------------------
#
# The cell below defines some contants and utility funtions so we can generate ``SVG`` to show the local overlapping alignments.
# <codecell>
arrow_defs = """<defs>
<marker id="Triangle_green" stroke="green" fill="green"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
<marker id="Triangle_blue" stroke="blue" fill="blue"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
<marker id="Triangle_black" stroke="black" fill="black"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
<marker id="Triangle_red" stroke="red" fill="red"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
</defs>"""
def svg_arrow( x1, y1, x2, y2, col, w):
return """<g stroke = "%s" stroke-width = "%f" fill = "none">\
<path d = "M %f %f L %f %f" marker-end="url(#Triangle_%s)"/> </g>""" % (col, w, x1, y1, x2, y2, col)
def svg_line( x1, y1, x2, y2, col, w):
return """<g stroke = "%s" stroke-width = "%f" fill = "none">\
<path d = "M %f %f L %f %f"/> </g>""" % (col, w, x1, y1, x2, y2)
class Overlap_SVG_view:
def __init__(self):
self.header = """<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" version="1.1" width="900px" height="300px">
"""
self.footer = """</svg>"""
self.elements = []
self._svg_data = None
def add_element(self, elm):
self.elements.append(elm)
def _construct_SVG(self):
SVG_str = []
SVG_str.append( self.header )
for elm in self.elements:
SVG_str.append( elm )
SVG_str.append( self.footer )
self._svg_data = "\n".join(SVG_str)
self._svg_data = self._svg_data.decode('utf-8')
def output(self, filename):
if self._svg_data is None:
self._construct_SVG()
with open(filename, "w") as out_f:
print >> out_f, self._svg_data
def _repr_svg_(self):
if self._svg_data is None:
self._construct_SVG()
return self._svg_data
@property
def svg(self):
return SVG(self._repr_svg_())
# <markdowncell>
# ``generate_overlap_view``: a function taking the overlapping data to generate a ``SVG`` file
# ---------------------------------------------------------------------------------------------
#
# The ``generate_overlap_view`` simple does several coordinate transformations to create a ``SVG`` file so we can see the local overlap around a given DNA fragment. The blank arrow indicates the fragment in interesting. Green and blue arrows indicate the forward and reversed proper overlapped alignments. The red arrows indicate containment alignments or improper alignments (locally inconsistent alignments due to repeats, mis-mapping or sequencing artifacts).
# <codecell>
from IPython.display import SVG
import os
def generate_overlap_view(overlap_data, q_name):
containment_tolerance = 10
hits = overlap_data[q_name].hits
SV = Overlap_SVG_view()
SV.add_element(arrow_defs)
x_offset = 10000
x_scaling = 0.02
x1 = 0
x2 = overlap_data[q_name].length
x1 += x_offset
x2 += x_offset
x1 *= x_scaling
x2 *= x_scaling
y = 200
SV.add_element(svg_arrow(x1, y, x2, y, "black", 4))
hits.sort(key = lambda k: (k[3][1],k[3][3],k[3][2]))
for h in hits:
t_name, t_score, t_idt = h[0], h[1], h[2]
if t_name in contained_reads:
continue
q_strand, q_start, q_end, q_len = h[3]
t_strand, t_start, t_end, t_len = h[4]
y -=10
x1 = q_start
x2 = q_end
col = None
if q_start < containment_tolerance and t_len - t_end < containment_tolerance:
x1 = -t_start
elif q_len - q_end < containment_tolerance and t_start < containment_tolerance:
x2 = q_end + (t_len - t_end)
else:
col = "red"
if t_strand == 1:
x1, x2 = x2, x1
print h
if t_strand == 0 and col == None:
col = "green"
elif col == None:
col = "blue"
x1, x2 = x2, x1
#print h
#print min(x1, x2), max(x1, x2), abs(x1-x2)
x1 += x_offset
x2 += x_offset
x1 *= x_scaling
x2 *= x_scaling
SV.add_element(svg_arrow(x1, y, x2, y, col, 3))
return SV
q_name = overlap_data.keys()[0]
OSV = generate_overlap_view(overlap_data, q_name)
OSV.output("test.svg")
#display(OSV.svg)
# <markdowncell>
# Generate the "best overlap graph"
# ---------------------------------
#
# Given that we see a beautiful circle overlapping graph, we like to see if can choose a seeding frgament and extend the overlapping to form a contigs.
# The code below scans through the ``overlap_data`` and assign the best score alignments for both 3' and 5' of a read as the best partner. The return of the ``get_best_overlap_graph`` is a dictionary where the key is a fragment identifier and the value is the alignment information of the 3' and 5' best overlapping alignments.
# <codecell>
def filter_overlap_hit(overlap_hit):
containment_tolerance = 50
q_strand, q_start, q_end, q_len = overlap_hit[3]
t_strand, t_start, t_end, t_len = overlap_hit[4]
if q_len - (q_end - q_start) < containment_tolerance and t_len - (t_end - t_start) / t_len < containment_tolerance:
return False
elif overlap_hit[0] in contained_reads:
return False
else:
return True
def get_best_overlap_graph(overlap_data):
best_overlap_graph = {}
containment_tolerance = 50
for q_name in overlap_data:
if q_name in contained_reads:
continue
if q_name not in best_overlap_graph:
best_overlap_graph[q_name] = {"5p":None, "3p":None}
targets = overlap_data[ q_name ].hits
targets_5prime = [ h for h in targets if h[3][1] < containment_tolerance and filter_overlap_hit(h)]
targets_3prime = [ h for h in targets if h[4][1] < containment_tolerance and filter_overlap_hit(h)]
targets_5prime.sort(key = lambda k:k[1])
targets_3prime.sort(key = lambda k:k[1])
if len(targets_5prime) > 0:
best_overlap_graph[q_name]["5p"] = targets_5prime[0]
t_name = targets_5prime[0][0]
if len(targets_3prime) > 0:
best_overlap_graph[q_name]["3p"] = targets_3prime[0]
t_name = targets_3prime[0][0]
return best_overlap_graph
# <markdowncell>
# Check the output of ``get_best_overlap_graph``
# ----------------------------------------------
# The cell below find the best alignment from the read of the first key in the ``best_overlap_graph``. It shows the best 3' and 5' partners.
# <codecell>
best_overlap_graph = get_best_overlap_graph(overlap_data)
q_name = best_overlap_graph.keys()[0]
print q_name, "has best partners:", best_overlap_graph[q_name]
# <markdowncell>
# Find a best overlapping path from a seeding fragment
# ----------------------------------------------------
#
# The two functions defined below allows us to extend the overlapping through the chain of overlapped fragments. The extension is terminated when (1) some inconsistent overlapping is detected (2) the overlapped fragment is already used by other contigs (3) no overlapped fragment is found and (4) a circle is detected in the path. Such rules may not be the best way to generate unitigs but they are very simple to implement. At least, these rules do work well for this E. coli data set. One can image some more complicated scenarios that it will be necessary to refine these rules or add more rules to get the best results and avoid mis-assemblies.
# <codecell>
def proper_overlap_hit(overlap_hit):
containment_tolerance = 50
q_strand, q_start, q_end, q_len = overlap_hit[3]
t_strand, t_start, t_end, t_len = overlap_hit[4]
if q_start < containment_tolerance:
if t_len - t_end > containment_tolerance:
return False
else:
return True
if t_start < containment_tolerance:
if q_len - q_end > containment_tolerance:
return False
else:
return True
def find_path( q_name_end, frag_used = set() ):
reverse_end = {"3p":"5p", "5p":"3p"}
path = []
path_q_name = set()
q_name, end = q_name_end
if end == "5p":
reversing_path = True
else:
reversing_path = False
while 1:
if q_name in frag_used:
if reversing_path:
path.reverse()
return path, "frag_used"
path.append( (q_name, end) )
path_q_name.add(q_name)
if q_name not in best_overlap_graph:
if reversing_path:
path.reverse()
return path, "terminate_1"
next_hit = best_overlap_graph[q_name][end]
#print next_hit
if next_hit == None:
if reversing_path:
path.reverse()
return path, "terminate_2"
if next_hit[0] in best_overlap_graph: #if not mutual good hits, break the path
# Using mutual best hit might be to strigent,
# bh = []
#if best_overlap_graph[next_hit[0]]["5p"]:
# bh.append( best_overlap_graph[next_hit[0]]["5p"][0] )
#if best_overlap_graph[next_hit[0]]["3p"]:
# bh.append( best_overlap_graph[next_hit[0]]["3p"][0] )
bh = [h[0] for h in overlap_data[next_hit[0]].hits if proper_overlap_hit(h)]
if q_name not in bh:
if reversing_path:
path.reverse()
return path, "branch"
q_name = next_hit[0]
if q_name in path_q_name:
if reversing_path:
path.reverse()
return path, "circle"
if next_hit[4][0] == 1: #reversed strand
end = reverse_end[end]
# <markdowncell>
# Examples of the best overlapping paths
# --------------------------------------
#
# Here we show how to call the ``find_path`` function to get the overlapping path for both 5' and 3' from a seed fragment. We pick the longest fragment in the data set as the seed. One will notice that the total fragments in from both 5'-parh and 3'-path are greater than the total number of fragments in the ``best_overlap_graph``. It mean that there are significant overlapping between the 5'-path and 3'-path. This is consistent with the genome is circular.
# <codecell>
len_qname = [ (overlap_data[x].length, x) for x in best_overlap_graph.keys() ]
len_qname.sort()
q_name = len_qname[-1][1]
#q_name = "0xa4919d66_0061381"
print "The longest fragment is", q_name
path_5p, s_5p = find_path( (q_name, "5p") )
path_3p, s_3p = find_path( (q_name, "3p") )
print "The number of the fragment of the 5' path is", len(path_5p)
print "The number of the fragment of the 3' path is", len(path_3p)
print "The total number of framgent of both 3' and 5' path is %d." % (len(path_3p)+len(path_5p)-1)
print "%d is greater than the total number of fragments %d in the best_overlap_graph." % (len(path_3p)+len(path_5p)-1, len(best_overlap_graph))
print "The begin of the 5'-path is", path_5p[0]
print "The end of the 3'-path is", path_3p[-1]
# <markdowncell>
# Build a squence database
# ------------------------
# Since we get the paths that seems covering the whole genome, we can use the paths to construct draft contigs. In order to do that, we need to get the fragment sequences.
# <codecell>
seq_db = {}
with open("pre_assembled_reads.fa") as seq_file:
for l in seq_file:
l = l.strip()
if l[0] == ">":
name = l[1:]
continue
else:
seq_db[name] = l
# <markdowncell>
# Build a "draft" contig using the best overlapping paths
# -------------------------------------------------------
#
# The ``layout_path`` function goes through a layout path and fill in the base from the fragment sequences. Nothing is really special here. However, it is alwaye tricky to track the right orientation of the fragements. The function also tracks the fragments used in the layout. Each fragement can only be assigned to one layout/contig.
# <codecell>
def rc_seq(seq):
rev_map = dict(zip("acgtACGTNn-","tgcaTGCANn-"))
return "".join([rev_map[c] for c in seq[::-1]])
def layout_path(full_path, frag_used, out_fn, tig_name):
if len(full_path) == 0 or full_path[0][0] in frag_used:
return None
if len(full_path) == 1:
with open("singleton_"+out_fn, "w") as out_seq_file:
seq = seq_db[full_path[0][0]]
print >>out_seq_file, ">%s" % ( "singleton_" + tig_name )
print >>out_seq_file, seq
frag_used.add(full_path[0][0])
return None
first = full_path[0]
offset = 0
current_orientation = first[1]
revserse_orientation = {"5p":"3p", "3p":"5p"}
tig_seq = []
frag_in_layout = set()
frag_in_layout.add(first[0])
for second in full_path[1:]:
overlap_hit = overlap_data[first[0]][second[0]]
q_strand, q_start, q_end, q_len = overlap_hit[3]
t_strand, t_start, t_end, t_len = overlap_hit[4]
#print overlap_hit, offset, current_orientation
seq = seq_db[first[0]]
if current_orientation == "3p":
seq = rc_seq(seq)
del tig_seq[offset:-1]
tig_seq.extend(seq)
if current_orientation == "5p":
offset += q_start
elif current_orientation == "3p":
offset += q_len - q_end
if t_strand == 1:
current_orientation = revserse_orientation[current_orientation]
frag_in_layout.add(first[0])
first = second
if second[0] in frag_in_layout:
break
if second[0] in frag_used:
break
seq = seq_db[first[0]]
if current_orientation == "3p":
seq = rc_seq(seq)
del tig_seq[offset:-1]
tig_seq.extend(seq)
frag_in_layout.add(first[0])
frag_used.update( frag_in_layout )
tig_seq = tig_seq[0:offset+len(seq)]
tig_seq = "".join(tig_seq)
with open(out_fn, "w") as out_seq_file:
print >>out_seq_file, ">%s" % tig_name
print >>out_seq_file, tig_seq
# <markdowncell>
# Loop through all fragments to generate all contigs and singletons
# -----------------------------------------------------------------
#
# Ok, ready for the primer time!!
#
# The code below will pick the longest fragment as the first seed and generate a contig from its overlapping path. Then, we will pick another longest one from those reads that have not been used in a contig. This process repeats until no more fragement is avaiable.
#
# In the data set, we generate one big contig (4650011 bp) and 49 singleton reads.
# <codecell>
frag_used = set()
all_best_overlap_frags = set(best_overlap_graph.keys())
unused_frag = all_best_overlap_frags - frag_used
i = 0
while len(unused_frag) != 0:
len_qname = [ (overlap_data[x].length, x) for x in unused_frag ]
len_qname.sort()
q_name = len_qname[-1][1]
print "iteration: ",i
print "frag is used?", q_name in frag_used
if q_name in frag_used:
continue
path_5p, s_5p = find_path( (q_name, "5p"), frag_used )
path_3p, s_3p = find_path( (q_name, "3p"), frag_used )
print "seeding frag:", q_name, "Length:", overlap_data[q_name].length
print "number of unused frag:", len(unused_frag), "total overlapped frag:", len(path_3p)+len(path_5p)
print len(path_5p), s_5p, len(path_3p), s_3p
print "--------------------------"
#print path_5p[0], path_3p[-1]
if len(path_5p) + len(path_3p) == 0:
out_fn = "tig_%05d.fa" % i
tig_name ="tig%05d" % i
with open("singleton_"+out_fn, "w") as out_seq_file:
seq = seq_db[q_name]
print >>out_seq_file, ">%s" % ( "singleton_" + tig_name )
print >>out_seq_file, seq
frag_used.add(q_name)
unused_frag = all_best_overlap_frags - frag_used
i += 1
continue
if len(path_5p) > 0:
#assert path_5p[-1] == path_3p[0]
full_path = path_5p + path_3p[1:]
else:
full_path = path_3p
layout_path(full_path, frag_used, "tig_%05d.fa" % i, "tig%05d" % i)
unused_frag = all_best_overlap_frags - frag_used
i += 1
# <markdowncell>
# What are the singletons?
# ================================
#
# One can align the contigs to the canoincal E. coli K12 reference to evaluate the quality of the assembly. One can see from the alignment data below, it seems that some fragments have lower accuracy that does not pass the 98% identity threshold to be called "overlapped" with other fragments. Three of them does not have full alginments. In this case, one will need to examine the raw reads that corresponds to these three fragments to see whether they are minor variants or some lower quality reads are not filtered.
# <codecell>
#!cat singleton_tig_000* > singletons.fa
#!blasr singletons.fa ecoli_k12_MG1655.fasta -m 4 -noSplitSubreads -nCandidates 20 -bestn 1 | awk '{print $1" "$2" "$4" "$6" "$7" "$8" "$8-($7-$6)}'
# <markdowncell>
# How good is the main contig?
# ============================
#
# The main contig is 4650011bp. It seems that it should cover the whole genome. We can see the whole genome alignment using ``gepard`` ( http://www.helmholtz-muenchen.de/en/mips/services/analysis-tools/gepard/index.html ). We can see there is no larger scale mis-assembly. Promising!!
# (The strand of the K12 for this data is actuall slight different from the canonical one. We do expect to see some small structure variations.)
# <codecell>
Image(filename="tig_align.jpeg")
# <markdowncell>
# Accuracy?
# ---------
# We don't expect to get high accuracy from the "draft" assembly. The way we construct the contig is just to copy-and-paste from the pre-assembled reads which have mean identity 99.789% to the reference genome. For a 4.7Mb genome, we will expect 4700000 * (1-0.99787) ~ 10k differences. One can do another round of consensus using the pre-assembled reads. Or, we can align the raw reads on top of the draft contig and apply the "Quiver" algorithm to get the best accuracy from all raw data. Let's see how "Quiver" helps to remove the errors in the draft contig.
# <markdowncell>
# First, let's evalute the differece between the draft contigs to the canonical reference with the ``dnadiff`` from Mummer3 package ( http://mummer.sourceforge.net/ )
# <codecell>
#!dnadiff ecoli_k12_MG1655.fasta tig_00000.fa -p diff_tig0
#!echo the number of SNPs is `cat diff_tig0.snps | wc | awk '{print $1}'`
# <markdowncell>
# The draft contig has about 10k SNPs as what we expect from the estimation earlier. We apply ``Quvier`` consensus algorithm on the the draft contig using the following script::
#
# #!/bin/bash
# export SEYMOUR_HOME=/PathTo/PacBio/SMRTAnalysis
# . $SEYMOUR_HOME/etc/setup.sh
# cd /Path/To/WorkingDirectory
# cp tig_00000.fa asm.ctg.fasta
# referenceUploader -c -p $PWD -n assembly -f asm.ctg.fasta --skipIndexUpdate
#
# compareSequences.py --info --useGuidedAlign --algorithm=blasr --nproc=24 --noXML --h5mode=w --h5fn=out.cmp.h5 --minAccuracy=0.70 --minLength=200 \
# -x -nCandidates 50 -x -minMatch 12 -x -bestn 1 -x -minPctIdentity 70.0 /Path/To/WorkingDirectory/input.fofn assembly/
#
# loadPulses /Path/To/WorkingDirectory/input.fofn out.cmp.h5\
# -metrics DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag -byread \
#
# cmph5tools.py sort out.cmp.h5 --tmp /scratch
#
# variantCaller.py --algorithm quiver -j 16 --referenceFilename assembly/sequence/assembly.fasta \
# --parameters best -o output.gff -o output.fasta -o output.fastq -q 0 -X 80 -x 5 --mapQvThreshold 0 out.cmp.h5
#
# This will generate the consensus sequence using Quiver. The consensus output is in ``output.fasta``. We find 26 SNPs in the final consensus results. This gives a lower bound of the final assembly phred QV about -10*log10(26/4639675) = 52.5. One can see that the ``Quiver`` algorithm which utilizes all important information from the raw trace signal is able to correct the ~10k errors in the
# draft assembly. This allows us more freedom while constructing the initial draft assembly. We do not need to construct perfect draft assembly from perfect reads. The goal for constructing the draft is to get the contiguity right. Given the long reads, one can expect that most mapping is unambiguous and that will lead to almost perfect consensus using ``Quiver``.
# <codecell>
#dnadiff ecoli_k12_MG1655.fasta output.fasta -p diff_quiver_tig0
#!echo the number of SNPs is `cat diff_quiver_tig0.snps | wc | awk '{print $1}'`
from math import log
print "The phred scale QV of the assembly is about %0.1f" % (-10*log(26.0/4639675)/log(10))
print "The concordnace is about %0.5f%%" % (100*(1-26.0/4639675))
# <markdowncell>
# Final Thought
# =============
#
# I had some fun with this exercise. I think this approach is general but it will
# need more tests for sure. We only have n=1 successful story for now. However,
# the reality is when we can generate longer and longer reads, the assembly
# problem will become easier and easier. DNA sequencing is about connecting
# different bases across distance. Some problems can be only solved by enough
# read length spans. I do like to see more fundamental theories published about
# genome assembly and haplotyping (e.g. the classical Lander-Waterman paper and
# the recent careful analysis done by Davie Tse's group
# http://arxiv.org/abs/1301.0068 ). Getting good consensus from long reads is
# probably a relative simpler problem than dealing with combinatorial explosion
# when one tries to infer long range information using short reads. Taking
# bacteria assembly as an example, it is likely impossible to assemble the E.
# coli K12 genome to single contig with single-end short read library. Without
# any long pair-end or jumping libraries, the single-end short read data simply
# does not the power to resolve various scales of repeat structure in the genome.
# Such constraints can not be solved by simply sequencing more (see Tse's argument).
# Actually, regardless how cheap DNA sequencing is, without proper understanding
# the fundamental theories and mathematical constraints, one could just waste
# money collecting tons of useless data. Cheap data still costs money. Cheap data
# that can not solve problems is actually expansive. One can sequence E coli. K12 genome
# to 10000x coverage with short read technologies cheaply. However, without
# proper libraries to get long range information, 10000x coverage from short fragments
# will still lead to fragmented assemblies. One the other hand, one can see from
# the example in this ``IPython`` Notebook. Once we can generate good quality long reads,
# even an assembly newbie like me can put together a prototype of an assembler with python
# in a few days. By the way, it only takes 20 seconds on my 2012 MacBook Air (Intel i7) to
# get the draft assembly from the ``m4`` file.
# <codecell>
#!time python Write_An_Assembler.py > /dev/null
# <codecell>