OSD assemblies

The OSD analysis group provides two different kind of assemblies.

One individual assembly per OSD metagenome
and one combined assembly with all OSD metagenomes.

Combined assemblies

For the combined assemblies we used the [OSD non-merged workable dataset] (http://mb3is.megx.net/osd-files?path=/2014/datasets/workable/metagenomes/non-merged) for all OSD samples. We combined all reads (paired-ends and single-ends) and ran three different assemblers IDBA-UD, RAY and MEGAHIT. After analyzing the results we chose the IDBA-UD assembly.

TODO: Add results page

IDBA-UD command used:

idba_ud --pre-correction --mink 21 --maxk 121 --step 22 --num_threads 64 -r OSD2014-combined_workable.I.fasta -l OSD2014-combined_workable.SR.fasta -o OSD2014-combined-assembly

After the assembly is done, we performed the following steps with the contigs > 500bp:

Map paired and single end reads using BWA-MEM short read aligner.

bwa index OSD2014-combined-assembly-contigs.fa
bwa mem -M -t 16 OSD2014-combined-assembly-contigs.fa <(zcat OSD2014-combined_workable.SR.fastq.gz) > OSD2014-combined-assembly.se.sam
bwa mem -M -t 16 OSD-combined-assembly-contigs.fa <(zcat OSD2014-combined_workable.R1.fastq.gz) <(zcat OSD2014-combined_workable.R2.fastq.gz) > OSD2014-combined-assembly.se.sam

Index contigs

samtools faidx OSD2014-combined-assembly-contigs.fa

Convert paired and single end SAM files to BAM and sort. We only keep mapped reads (-F 4) and with a mapquality of 10 (-q 10). Note: We use a fork from samtools to speed up the sorting.

samtools view -@ 16 -q 10 -F 4 -bt OSD2014-combined-assembly-contigs.fa.fai OSD2014-combined-assembly.se.sam | samtools rocksort -@ 16 -m 3 - OSD2014-combined-assembly.se
samtools view -@ 16 -q 10 -F 4 -bt OSD2014-combined-assembly-contigs.fa.fai OSD2014-combined-assembly.pe.sam |  samtools rocksort -@ 16 -m 3 - OSD2014-combined-assembly.pe

Merge BAM files. Note: Recommend to use samtools 0.1.19 or 1.1 for merging, newer versions of samtools has problems with files with a large amount of contigs.

samtools merge -f -@ 16 OSD2014-combined-assembly.bam OSD2014-combined-assembly.pe.bam OSD2014-combined-assembly.se.bam

Sort and index the merged BAM file

samtools rocksort -@ 16 -m 3 OSD2014-combined-assembly.bam OSD2014-combined-assembly.sorted
samtools index OSD2014-combined-assembly.sorted.bam

Mark duplicates with Picard tools.

JAVA_OPT="-Xms2g -Xmx32g -XX:ParallelGCThreads=4 -XX:MaxPermSize=2g -XX:+CMSClassUnloadingEnabled"
java ${JAVA_OPT} \
   -jar MarkDuplicates \
   INPUT=OSD2014-combined-assembly.sorted.bam \
   OUTPUT=OSD2014-combined-assembly.sorted.markdup.bam \ 
   METRICS_FILE=OSD2014-combined-assembly.sorted.markdup.metrics \ 
   AS=TRUE \
   VALIDATION_STRINGENCY=LENIENT \
   MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
   REMOVE_DUPLICATES=TRUE

Sort, index and get some statistics from the resulting BAM file

samtools rocksort -@ 16 -m 3 OSD2014-combined-assembly.sorted.markdup.bam OSD2014-combined-assembly.sorted.markdup.sorted
samtools index OSD2014-combined-assembly.sorted.markdup.sorted.bam
samtools flagstat OSD2014-combined-assembly.sorted.markdup.sorted.bam > OSD2014-combined-assembly.sorted.markdup.sorted.flagstat

Get mean coverage per contig with Bedtools and a script from Metassemble

genomeCoverageBed -ibam OSD2014-combined-assembly.sorted.markdup.sorted.bam > OSD2014-combined-assembly.sorted.markdup.sorted.coverage
python gen_contig_cov_per_bam_table.py --isbedfiles OSD2014-combined-assembly.fa OSD2014-combined-assembly.sorted.markdup.sorted.coverage > OSD2014-combined-assembly.sorted.markdup.sorted.coverage.percontig

Perform gene prediction with Prodigal

prodigal -q -i OSD2014-combined-assembly.fa -a OSD2014-combined-assembly.aa.fasta -d OSD2014-combined-assembly.nt.fasta -p meta -o OSD2014-combined-assembly.gff -f gff

Convert gff output from Prodigal to bed format with Bedops

gff2bed < OSD2014-combined-assembly.gff | sort --parallel=8 -k1,1 -k2,2n > OSD2014-combined-assembly.bed

Convert BAM file to bed format with Bedtools

bamToBed -i OSD2014-combined-assembly.sorted.markdup.sorted.bam | sort --parallel=8 -k1,1 -k2,2n > OSD2014-combined-assembly.sorted.markdup.sorted.bed

Count gene coverage using Bedtools

bedtools coverage -hist -sorted  -b OSD2014-combined-assembly.sorted.markdup.sorted.bed -a OSD2014-combined-assembly.bed > OSD2014-combined-assembly.coverage.hist

Calculate the mean coverage of each ORF with the following awk script

mawk -f file-get_orf_coverage_from_bam-coassembly.awk OSD2014-combined-assembly.coverage.hist > OSD2014-combined-assembly.orf-coverage.txt

NOTE: We filtered out all contigs with a mean coverage < 2.

Files available for download:

All results: Link
Contigs: Link
ORFs nucleotide: Link
ORFs aminoacid: Link
ORF mean coverage: Link

Individual assemblies

For the individual assemblies we used the [OSD non-merged workable dataset] (http://mb3is.megx.net/osd-files?path=/2014/datasets/workable/metagenomes/non-merged) for all OSD samples. We ran three different assemblers IDBA-UD, RAY and SPADES. After analyzing the results we chose the SPADES assembly.

TODO: Add results page

SPADES command used:

spades.py -1 OSD1_R1_shotgun_workable.fastq.gz -2 OSD1_R2_shotgun_workable.fastq.gz -s OSD1_SR_shotgun_workable.fastq.gz -t 16 -o OSD1-spades --careful -m 100

After the assembly is done we performed the same steps as described for the combined assembly.

Files per sample available for download:

All results: Link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSD assemblies

Combined assemblies

Individual assemblies

Clone this wiki locally