-
Notifications
You must be signed in to change notification settings - Fork 7
OSD assemblies
The OSD analysis group provides two different kind of assemblies.
- One individual assembly per OSD metagenome
- and one combined assembly with all OSD metagenomes.
For the combined assemblies we used the [OSD non-merged workable dataset] (http://mb3is.megx.net/osd-files?path=/2014/datasets/workable/metagenomes/non-merged) for all OSD samples. We combined all reads (paired-ends and single-ends) and ran three different assemblers IDBA-UD, RAY and MEGAHIT. After analyzing the results we chose the IDBA-UD assembly.
TODO: Add results page
IDBA-UD command used:
idba_ud --pre-correction --mink 21 --maxk 121 --step 22 --num_threads 64 -r OSD2014-combined_workable.I.fasta -l OSD2014-combined_workable.SR.fasta -o OSD2014-combined-assembly
After the assembly is done, we performed the following steps with the contigs > 500bp:
- Map paired and single end reads using BWA-MEM short read aligner.
bwa index OSD2014-combined-assembly-contigs.fa
bwa mem -M -t 16 OSD2014-combined-assembly-contigs.fa <(zcat OSD2014-combined_workable.SR.fastq.gz) > OSD2014-combined-assembly.se.sam
bwa mem -M -t 16 OSD-combined-assembly-contigs.fa <(zcat OSD2014-combined_workable.R1.fastq.gz) <(zcat OSD2014-combined_workable.R2.fastq.gz) > OSD2014-combined-assembly.se.sam
- Index contigs
samtools faidx OSD2014-combined-assembly-contigs.fa
- Convert paired and single end SAM files to BAM and sort. We only keep mapped reads (-F 4) and with a mapquality of 10 (-q 10). Note: We use a fork from samtools to speed up the sorting.
samtools view -@ 16 -q 10 -F 4 -bt OSD2014-combined-assembly-contigs.fa.fai OSD2014-combined-assembly.se.sam | samtools rocksort -@ 16 -m 3 - OSD2014-combined-assembly.se
samtools view -@ 16 -q 10 -F 4 -bt OSD2014-combined-assembly-contigs.fa.fai OSD2014-combined-assembly.pe.sam | samtools rocksort -@ 16 -m 3 - OSD2014-combined-assembly.pe
- Merge BAM files. Note: Recommend to use samtools 0.1.19 or 1.1 for merging, newer versions of samtools has problems with files with a large amount of contigs.
samtools merge -f -@ 16 OSD2014-combined-assembly.bam OSD2014-combined-assembly.pe.bam OSD2014-combined-assembly.se.bam
- Sort and index the merged BAM file
samtools rocksort -@ 16 -m 3 OSD2014-combined-assembly.bam OSD2014-combined-assembly.sorted
samtools index OSD2014-combined-assembly.sorted.bam
- Mark duplicates with Picard tools.
JAVA_OPT="-Xms2g -Xmx32g -XX:ParallelGCThreads=4 -XX:MaxPermSize=2g -XX:+CMSClassUnloadingEnabled"
java ${JAVA_OPT} \
-jar MarkDuplicates \
INPUT=OSD2014-combined-assembly.sorted.bam \
OUTPUT=OSD2014-combined-assembly.sorted.markdup.bam \
METRICS_FILE=OSD2014-combined-assembly.sorted.markdup.metrics \
AS=TRUE \
VALIDATION_STRINGENCY=LENIENT \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
REMOVE_DUPLICATES=TRUE
- Sort, index and get some statistics from the resulting BAM file
samtools rocksort -@ 16 -m 3 OSD2014-combined-assembly.sorted.markdup.bam OSD2014-combined-assembly.sorted.markdup.sorted
samtools index OSD2014-combined-assembly.sorted.markdup.sorted.bam
samtools flagstat OSD2014-combined-assembly.sorted.markdup.sorted.bam > OSD2014-combined-assembly.sorted.markdup.sorted.flagstat
- Get mean coverage per contig with Bedtools and a script from Metassemble
genomeCoverageBed -ibam OSD2014-combined-assembly.sorted.markdup.sorted.bam > OSD2014-combined-assembly.sorted.markdup.sorted.coverage
python gen_contig_cov_per_bam_table.py --isbedfiles OSD2014-combined-assembly.fa OSD2014-combined-assembly.sorted.markdup.sorted.coverage > OSD2014-combined-assembly.sorted.markdup.sorted.coverage.percontig
- Perform gene prediction with Prodigal
prodigal -q -i OSD2014-combined-assembly.fa -a OSD2014-combined-assembly.aa.fasta -d OSD2014-combined-assembly.nt.fasta -p meta -o OSD2014-combined-assembly.gff -f gff
- Convert gff output from Prodigal to bed format with Bedops
gff2bed < OSD2014-combined-assembly.gff | sort --parallel=8 -k1,1 -k2,2n > OSD2014-combined-assembly.bed
- Convert BAM file to bed format with Bedtools
bamToBed -i OSD2014-combined-assembly.sorted.markdup.sorted.bam | sort --parallel=8 -k1,1 -k2,2n > OSD2014-combined-assembly.sorted.markdup.sorted.bed
- Count gene coverage using Bedtools
bedtools coverage -hist -sorted -b OSD2014-combined-assembly.sorted.markdup.sorted.bed -a OSD2014-combined-assembly.bed > OSD2014-combined-assembly.coverage.hist
- Calculate the mean coverage of each ORF with the following awk script
mawk -f file-get_orf_coverage_from_bam-coassembly.awk OSD2014-combined-assembly.coverage.hist > OSD2014-combined-assembly.orf-coverage.txt
NOTE: We filtered out all contigs with a mean coverage < 2.
Files available for download:
For the individual assemblies we used the [OSD non-merged workable dataset] (http://mb3is.megx.net/osd-files?path=/2014/datasets/workable/metagenomes/non-merged) for all OSD samples. We ran three different assemblers IDBA-UD, RAY and SPADES. After analyzing the results we chose the SPADES assembly.
TODO: Add results page
SPADES command used:
spades.py -1 OSD1_R1_shotgun_workable.fastq.gz -2 OSD1_R2_shotgun_workable.fastq.gz -s OSD1_SR_shotgun_workable.fastq.gz -t 16 -o OSD1-spades --careful -m 100
After the assembly is done we performed the same steps as described for the combined assembly.
Files per sample available for download:
- All results: Link