Merge pull request #2 from andrewjpage/master

Output scaffolds, cleanup SPAdes output, expand documentation
sanger-pathogens · Jan 18, 2017 · 043b709 · 043b709
2 parents 76537c2 + a2f88c8
commit 043b709
Show file tree

Hide file tree

Showing 7 changed files with 148 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -3,16 +3,141 @@ You have a set of samples where you have a known phenotype, and a set of control
 
 [![Build Status](https://travis-ci.org/sanger-pathogens/plasmidtron.svg?branch=master)](https://travis-ci.org/sanger-pathogens/plasmidtron)
 
+#Usage
+```
+usage: plasmidtron [options] output_directory file_of_trait_fastqs file_of_nontrait_fastqs
+
+A tool to assemble parts of a genome responsible for a trait
+
+positional arguments:
+  output_directory      Output directory
+  file_of_trait_fastqs  File of filenames of trait (case) FASTQs
+  file_of_nontrait_fastqs
+                        File of filenames of nontrait (control) FASTQs
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --action {intersection,union}, -a {intersection,union}
+                        Control how the traits kmers are filtered for assembly
+                        [union]
+  --kmer KMER, -k KMER  Kmer to use, depends on read length [81]
+  --min_contig_len MIN_CONTIG_LEN, -l MIN_CONTIG_LEN
+                        Minimum contig length in final assembly [1000]
+  --min_kmers_threshold MIN_KMERS_THRESHOLD, -m MIN_KMERS_THRESHOLD
+                        Exclude k-mers occurring less than this [25]
+  --max_kmers_threshold MAX_KMERS_THRESHOLD, -x MAX_KMERS_THRESHOLD
+                        Exclude k-mers occurring more than this [254]
+  --threads THREADS, -t THREADS
+                        Number of threads [1]
+  --spades_exec SPADES_EXEC, -s SPADES_EXEC
+                        Set the SPAdes executable [spades.py]
+  --verbose, -v         Turn on debugging [0]
+  --version             show program's version number and exit
+```
+
+##Input parameters
+The following parameters change the results:
+
+__action__: There are two fundamental methods of operation. The default is 'union', where kmers which occur in ANY trait sample, but are absent from the nontrait samples, get used to filter the reads. So in effect you are assembling the whole accessory genome of the trait samples. This leads to larger end assemblies and more false positives, but will capture greater regions of the accessory genome. It is tolerant to situations where you have a plasmid which can vary substantially with different backbones or payloads.  The next is 'intersection', where kmers must occur in ALL trait samples and not in the nontrait samples. This leads to smaller end assemblies and more fragmentation, with less false positives.  It is less tolerant to variation. 
+
+__kmer__: Choosing a kmer is not an exact science, and have greatly influence the final results. This kmer size is used by KMC for counting and filtering, and by SPAdes for assembly.  Ideally it should be between about 60-90% of the read length, should be an odd number and between 21 and 127 (SPAdes restriction).  If choose a kmer which is too small, you will get a lot more false positives. If you choose a kmer too big, you will use a lot more RAM and potentially get too little data returned. Quite often with Illumina data the beginning and end of the reads have higher sequencing error rates. Ideally you want a kmer size which sits nicely inside the good cycles of the read. Trimming with Trimmomatic can help if the quality collapses quite badly at the end of the read.
+
+__min_contig_len__: This needs to be about a minimum of twice the mean fragment size (insert size) of your library to reduce the impact of false positives. For example if you have a single kmer which randomly occurs in the genome, using it will then allow for reads upstream and downstream, plus their mates, to be assembled. This variable can control this noise. Setting this too high will lead to valuable information being lost (e.g. small plasmids).
+
+__min_kmers_threshold__: This value lets you set a minimum threshold for the occurance of a kmer. Ideally you need about 30X depth of coverage to perform de novo assembly. This value default to 25, so excluding kmers below this level eliminates kmers where you wont get a good assembly, thus reducing false positives. The maximum value is 254, but the results poor.
+
+__max_kmers_threshold__: This value lets you set a maximum threshold for the occurance of a kmer. The occurance of kmers forms a Poisson distribution, with a very long tail. With KMC, there is a catchall bin for occurances of 255 and greater (so 255 is the maximum value). By default it is set to 254 which excludes this catchall bin for kmers, and thus the long tail of very common kmers. This reduces the false positives. You need to be careful when setting this lower because you could exclude all of the interesting kmers.
+
+
+The following parameters have no impact on the results:
+
+__threads__: This sets the number of threads available to KMC and SPAdes. It should never be more than the number of CPUs available on the server. If you use a compute cluster, make sure to request the same number of threads on a single server. It defaults to 1 and you will get a reasonable speed increase by adding a few CPUs, but the benefit tails off quite rapidly since the I/O becomes the limiting factor (speed of reading files from a disk or network).
+
+__spades_exec__: By default SPAdes is assumed to be in your PATH and called spades.py. You can set this to point to a different executable, which might be required if you have multiple versions of SPAdes installed.
+
+__verbose__: By default the output is limited and all intermediate files are deleted. Setting this flag allows you output more details of the software as it runs and it keeps the intermediate files.
+
+##Required resources
+###RAM (memory)
+The largest consumer of RAM (memory) is SPAdes. Assembling a whole bacteria takes approximately 4GB of RAM. If the filtering allows everything through then this worst case will occur, but generally less than 1GB of RAM is required. Poor quality sequencing data will increase the amount of RAM required. In this instance running Trimmomatic first will help greatly.
+
+###Disk space
+By default all of the intermediate files are cleaned up at the end, so the overall disk space usage is quite low. As an example, an input of 800 Mbytes of compressed reads created 40 Mbytes of output data at the end. While the algorithm is running the disk usage will never exceed the size of the input reads. The intermediate files can be kept if you use the 'verbose' option. 
 
 #Outputs 
 For every trait sample you will get an assembly of nucleotide sequences in FASTA format. You will also get a text file describing the process, with versions of software, parameters used and references.
 
 #Installation
-kmc version 2.3
-spades 3.9.0
+There are a number of installation methods. Choosing the right one for the system you use will simpliy the process.
+
+* Linux 
+  * Debian Testing/Ubuntu 16.04 (Xenial)
+  * Ubuntu 14.04 (Trusty)
+  * Ubuntu 12.04 (Precise)
+  * LinuxBrew
+* OSX
+  * HomeBrew
+* Linux/OSX/Windows/Cloud
+  * Docker
+
+##Linux
+The instructions for Linux assume you have root (sudo) on your machine.
+
+###Debian Testing/Ubuntu 16.04 (Xenial)
+
+```
+apt-get update -qq
+apt-get install -y kmc git python3 python3-setuptools python3-biopython python3-pip
+pip3 install git+git://github.com/sanger-pathogens/plasmidtron.git
+```
+
+###Ubuntu 14.04 (Trusty)
+You can either manually install [KMC](http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=about) and [SPAdes](http://bioinf.spbau.ru/spades), or use the install_dependancies script (you will need to add some paths to your PATH environment variable).
+
+```
+apt-get update -qq
+apt-get install -y wget git python3 python3-setuptools python3-biopython python3-pip
+wget https://raw.githubusercontent.com/sanger-pathogens/plasmidtron/master/install_dependancies.sh
+source ./install_dependancies.sh
+pip3 install git+git://github.com/sanger-pathogens/plasmidtron.git
+```
+
+###Ubuntu 12.04 (Precise)
+PlasmidTron uses BioPython, however the version of Python3 bundled with Precise is too old, so you will have to manually install Python3 (3.3+) along with pip3.
+Once you have done this you can proceed with the instructions below. You can either manually install [KMC](http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=about) and [SPAdes](http://bioinf.spbau.ru/spades), or use the install_dependancies script (you will need to add some paths to your PATH environment variable).
+
+```
+apt-get update -qq
+apt-get install -y git wget
+wget https://raw.githubusercontent.com/sanger-pathogens/plasmidtron/master/install_dependancies.sh
+source ./install_dependancies.sh
+pip3 install git+git://github.com/sanger-pathogens/plasmidtron.git
+```
+
+###Linuxbrew
+These instructions are untested. First install [LinuxBrew](http://linuxbrew.sh/), then follow the instructions below.
+
+```
+brew tap homebrew/science
+brew update
+brew install python3 kmc spades
+pip3 install git+git://github.com/sanger-pathogens/plasmidtron.git
+```
+
+##OSX
+###Homebrew
+These instructions are untested. First install [HomeBrew](http://brew.sh/), then follow the instructions below.
+
+```
+brew tap homebrew/science
+brew update
+brew install python3 kmc spades
+pip3 install git+git://github.com/sanger-pathogens/plasmidtron.git
+```
 
+#Linux/OSX/Windows/Cloud
 ##Docker 
-We have a docker container which gets automatically built from the latest version of PlasmidTron. To install it:
+Install [Docker](https://www.docker.com/).  We have a docker container which gets automatically built from the latest version of PlasmidTron. To install it:
 
 ```
 docker pull sangerpathogens/plasmidtron

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.0.2
+0.0.3
diff --git a/plasmidtron/Methods.py b/plasmidtron/Methods.py
@@ -78,7 +78,7 @@ def methods_paragraph(self, num_trait_samples, num_non_trait_samples, plasmidtro
 
 	def references_paragraph(self):
 		references_text =  'Anton Bankevich, Sergey Nurk, et. al. , and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021\n'
-		references_text += 'Andrew J. Page, Alexander Wailan,Yan Shao, Nicholas R. Thomson, Jacqueline A. Keane, "PlasmidTron: kmer based de novo assembly of genome sequences based on phenotypes", in preparation (2017).\n'
+		references_text += 'Andrew J. Page, Alexander Wailan,Yan Shao, Martin G. Hunt, Nicholas R. Thomson, Jacqueline A. Keane, "PlasmidTron: kmer based de novo assembly of genome sequences based on phenotypes", in preparation (2017).\n'
 		references_text += 'Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; doi: 10.1093/bioinformatics/btv022.\n'
 		return references_text
 
diff --git a/plasmidtron/PlasmidTron.py b/plasmidtron/PlasmidTron.py
@@ -50,9 +50,11 @@ def run(self):
 			kmc_filter.filter_fastq_file_against_kmers()
 			kmc_filters.append(kmc_filter)
 
+		spades_assemblies = []
 		for sample in trait_samples:
 			spades_assembly = SpadesAssembly( sample, self.output_directory, self.threads, self.kmer, self.spades_exec, self.min_contig_len)
 			spades_assembly.run()
+			spades_assemblies.append(spades_assembly)
 			print(spades_assembly.filtered_spades_assembly_file() + '\n')
 			sample.cleanup()
 
@@ -69,4 +71,7 @@ def run(self):
 
 			for kmc_filter in kmc_filters:
 				kmc_filter.cleanup()
+
+			for spades_assembly in spades_assemblies:
+				spades_assembly.cleanup()
 
diff --git a/plasmidtron/SpadesAssembly.py b/plasmidtron/SpadesAssembly.py
@@ -2,6 +2,7 @@
 import logging
 import tempfile
 import subprocess
+import shutil
 from Bio import SeqIO
 
 '''Assemble a filtered sample with SPAdes'''
@@ -20,10 +21,10 @@ def spades_command(self):
 		return ' '.join([self.spades_exec, '--careful', '--only-assembler','-k', str(self.kmer), '-1', self.sample.filtered_forward_file, '-2', self.sample.filtered_reverse_file, '-o', self.spades_output_directory])
 
 	def spades_assembly_file(self):
-		return os.path.join(self.spades_output_directory,'contigs.fasta')
+		return os.path.join(self.spades_output_directory,'scaffolds.fasta')
 
 	def filtered_spades_assembly_file(self):
-		return os.path.join(self.spades_output_directory,'filtered_contigs.fasta')
+		return os.path.join(self.spades_output_directory,'filtered_scaffolds.fasta')
 
 	def remove_small_contigs(self,input_file, output_file):
 		with open(input_file, "r") as spades_input_file, open(output_file, "w") as spades_output_file:
@@ -38,3 +39,10 @@ def run(self):
 		self.logger.info("Assembling sample" )
 		subprocess.call(self.spades_command(), shell=True)
 		self.remove_small_contigs(self.spades_assembly_file(), self.filtered_spades_assembly_file())
+
+	def cleanup():
+		shutil.rmtree(os.path.join(self.spades_output_directory, 'tmp' ))
+		shutil.rmtree(os.path.join(self.spades_output_directory, 'mismatch_corrector' ))
+		shutil.rmtree(os.path.join(self.spades_output_directory, 'misc' ))
+		shutil.rmtree(os.path.join(self.spades_output_directory, 'K'+self.kmer ))
+
diff --git a/plasmidtron/tests/PlasmidTron_test.py b/plasmidtron/tests/PlasmidTron_test.py
@@ -31,9 +31,9 @@ def test_small_valid_chrom_plasmid(self):
 		plasmid_tron = PlasmidTron(options)
 		plasmid_tron.run()
 
-		final_assembly = os.path.join(data_dir,'out/spades_S_typhi_CT18_chromosome_pHCM2/filtered_contigs.fasta')
+		final_assembly = os.path.join(data_dir,'out/spades_S_typhi_CT18_chromosome_pHCM2/filtered_scaffolds.fasta')
 
-		self.assertTrue(os.path.isfile(os.path.join(data_dir,'out/spades_S_typhi_CT18_chromosome_pHCM2/contigs.fasta')))
+		self.assertTrue(os.path.isfile(os.path.join(data_dir,'out/spades_S_typhi_CT18_chromosome_pHCM2/scaffolds.fasta')))
 		self.assertTrue(os.path.isfile(final_assembly))
 		'''The final assembly should be about 6k so leave some margin for variation in SPAdes'''
 		self.assertTrue(os.path.getsize(final_assembly) > 5000)

diff --git a/scripts/plasmidtron b/scripts/plasmidtron
@@ -23,7 +23,7 @@ parser.add_argument('file_of_trait_fastqs', help='File of filenames of trait (ca
 parser.add_argument('file_of_nontrait_fastqs', help='File of filenames of nontrait (control) FASTQs', type=InputTypes.is_file_of_nontrait_fastqs_valid)
 parser.add_argument('--action',	 '-a', help='Control how the traits kmers are filtered for assembly [%(default)s]', choices=['intersection','union'],  default = 'union')
 parser.add_argument('--kmer',	 '-k', help='Kmer to use, depends on read length [%(default)s]', type=InputTypes.is_kmer_valid,  default = 81)
-parser.add_argument('--min_contig_len',	 '-l', help='Minimum contig length in final assembly [%(default)s]', type=InputTypes.is_min_contig_len_valid, default = 600)
+parser.add_argument('--min_contig_len',	 '-l', help='Minimum contig length in final assembly [%(default)s]', type=InputTypes.is_min_contig_len_valid, default = 1000)
 parser.add_argument('--min_kmers_threshold',	 '-m', help='Exclude k-mers occurring less than this [%(default)s]', type=InputTypes.is_min_kmers_threshold_valid,  default = 25)
 parser.add_argument('--max_kmers_threshold',	 '-x', help='Exclude k-mers occurring more than this [%(default)s]', type=InputTypes.is_max_kmers_threshold_valid,  default = 254)
 parser.add_argument('--threads',  '-t', help='Number of threads [%(default)s]', type=InputTypes.is_threads_valid,  default = 1)