Skip to content
renzo edited this page Dec 4, 2015 · 9 revisions

The OSD analysis group provides two different kind of assemblies.

  1. One individual assembly per OSD metagenome
  2. and one combined assembly with all OSD metagenomes.

Combined assemblies

For the combined assemblies we used the [OSD non-merged workable dataset] (http://mb3is.megx.net/osd-files?path=/2014/datasets/workable/metagenomes/non-merged) for all OSD samples. We combined all reads (paired-ends and single-ends) and ran three different assemblers IDBA-UD, RAY and MEGAHIT. After analyzing the results we chose the IDBA-UD assembly.

TODO: Add results page

IDBA-UD command used:

idba_ud --pre-correction --mink 21 --maxk 121 --step 22 --num_threads 64 -r OSD2014-combined_workable.I.fasta -l OSD2014-combined_workable.SR.fasta -o OSD2014-combined-assembly

After the assembly is done, we performed the following steps with the contigs > 500bp:

  1. Map paired and single end reads using BWA-MEM short read aligner.
bwa index OSD2014-combined-assembly-contigs.fa
bwa mem -M -t 16 OSD2014-combined-assembly-contigs.fa <(zcat OSD2014-combined_workable.SR.fastq.gz) > OSD2014-combined-assembly.se.sam
bwa mem -M -t 16 OSD-combined-assembly-contigs.fa <(zcat OSD2014-combined_workable.R1.fastq.gz) <(zcat OSD2014-combined_workable.R2.fastq.gz) > OSD2014-combined-assembly.se.sam
  1. Index contigs
samtools faidx OSD2014-combined-assembly-contigs.fa
  1. Convert paired and single end SAM files to BAM and sort. We only keep mapped reads (-F 4) and with a mapquality of 10 (-q 10). Note: We use a fork from samtools to speed up the sorting.
samtools view -@ 16 -q 10 -F 4 -bt OSD2014-combined-assembly-contigs.fa.fai OSD2014-combined-assembly.se.sam | samtools rocksort -@ 16 -m 3 - OSD2014-combined-assembly.se
samtools view -@ 16 -q 10 -F 4 -bt OSD2014-combined-assembly-contigs.fa.fai OSD2014-combined-assembly.pe.sam |  samtools rocksort -@ 16 -m 3 - OSD2014-combined-assembly.pe
  1. Merge BAM files. Note: Recommend to use samtools 0.1.19 or 1.1 for merging, newer versions of samtools has problems with files with a large amount of contigs.
samtools merge -f -@ 16 OSD2014-combined-assembly.bam OSD2014-combined-assembly.pe.bam OSD2014-combined-assembly.se.bam
  1. Sort and index the merged BAM file
samtools rocksort -@ 16 -m 3 OSD2014-combined-assembly.bam OSD2014-combined-assembly.sorted
samtools index OSD2014-combined-assembly.sorted.bam
  1. Mark duplicates with Picard tools.
JAVA_OPT="-Xms2g -Xmx32g -XX:ParallelGCThreads=4 -XX:MaxPermSize=2g -XX:+CMSClassUnloadingEnabled"
java ${JAVA_OPT} \
   -jar MarkDuplicates \
   INPUT=OSD2014-combined-assembly.sorted.bam \
   OUTPUT=OSD2014-combined-assembly.sorted.markdup.bam \ 
   METRICS_FILE=OSD2014-combined-assembly.sorted.markdup.metrics \ 
   AS=TRUE \
   VALIDATION_STRINGENCY=LENIENT \
   MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
   REMOVE_DUPLICATES=TRUE
  1. Sort, index and get some statistics from the resulting BAM file
samtools rocksort -@ 16 -m 3 OSD2014-combined-assembly.sorted.markdup.bam OSD2014-combined-assembly.sorted.markdup.sorted
samtools index OSD2014-combined-assembly.sorted.markdup.sorted.bam
samtools flagstat OSD2014-combined-assembly.sorted.markdup.sorted.bam > OSD2014-combined-assembly.sorted.markdup.sorted.flagstat
  1. Get mean coverage per contig with Bedtools and a script from Metassemble
genomeCoverageBed -ibam OSD2014-combined-assembly.sorted.markdup.sorted.bam > OSD2014-combined-assembly.sorted.markdup.sorted.coverage
python gen_contig_cov_per_bam_table.py --isbedfiles OSD2014-combined-assembly.fa OSD2014-combined-assembly.sorted.markdup.sorted.coverage > OSD2014-combined-assembly.sorted.markdup.sorted.coverage.percontig
  1. Perform gene prediction with Prodigal
prodigal -q -i OSD2014-combined-assembly.fa -a OSD2014-combined-assembly.aa.fasta -d OSD2014-combined-assembly.nt.fasta -p meta -o OSD2014-combined-assembly.gff -f gff
  1. Convert gff output from Prodigal to bed format with Bedops
gff2bed < OSD2014-combined-assembly.gff | sort --parallel=8 -k1,1 -k2,2n > OSD2014-combined-assembly.bed
  1. Convert BAM file to bed format with Bedtools
bamToBed -i OSD2014-combined-assembly.sorted.markdup.sorted.bam | sort --parallel=8 -k1,1 -k2,2n > OSD2014-combined-assembly.sorted.markdup.sorted.bed
  1. Count gene coverage using Bedtools
bedtools coverage -hist -sorted  -b OSD2014-combined-assembly.sorted.markdup.sorted.bed -a OSD2014-combined-assembly.bed > OSD2014-combined-assembly.coverage.hist
  1. Calculate the mean coverage of each ORF with the following awk script
mawk -f file-get_orf_coverage_from_bam-coassembly.awk OSD2014-combined-assembly.coverage.hist > OSD2014-combined-assembly.orf-coverage.txt

NOTE: We filtered out all contigs with a mean coverage < 2.

Files available for download:

  • All results: Link
  • Contigs: Link
  • ORFs nucleotide: Link
  • ORFs aminoacid: Link
  • ORF mean coverage: Link

Individual assemblies

For the individual assemblies we used the [OSD non-merged workable dataset] (http://mb3is.megx.net/osd-files?path=/2014/datasets/workable/metagenomes/non-merged) for all OSD samples. We ran three different assemblers IDBA-UD, RAY and SPADES. After analyzing the results we chose the SPADES assembly.

TODO: Add results page

SPADES command used:

spades.py -1 OSD1_R1_shotgun_workable.fastq.gz -2 OSD1_R2_shotgun_workable.fastq.gz -s OSD1_SR_shotgun_workable.fastq.gz -t 16 -o OSD1-spades --careful -m 100

After the assembly is done we performed the same steps as described for the combined assembly.

Files per sample available for download: