Name	Type	Value Range	Default	Description	For Goals
`logLevel`	String	`all`, `trace`, `debug`, `info`, `warn`, `error`, `fatal`, `off`	`info`	Only the log levels `error`, `warn`, `info` and `trace` are used by Genestrip.	all
`logProgressUpdateCycle`	long	[0, 2147483647]	`1000000`	Affects the log level `trace`: Defines after how many reads per fastq file, information on the matching progress is logged. If less than 1, then no progress information is logged.	`match`, `matchlr`, `filter`
`threads`	int	[-1, 64]	`-1`	The number of consumer threads n when processing data with respect to the goals `match`, `filter` and also so during the update phase of the `db` goal. There is always one additional thread that reads and uncompresses a corresponding fastq or fasta file (so it is n + 1 threads in total). When negative, the number of available processors - 1 is used as n. When 0, then the corresponding goals run in single-threaded mode.	`db`, `match`, `matchlr`, `filter`
`httpBaseURL`	String		`https://ftp.ncbi.nlm.nih.gov`	This base URL will be extended by `/pub/taxonomy/` in order to download the taxonomy file `taxdmp.zip` and by `/genomes/genbank` for files from Genbank.	`db`
`ftpBaseURL`	String		`ftp.ncbi.nih.gov`		`db`
`refseqHttpBaseURL`	String		`https://ftp.ncbi.nlm.nih.gov/refseq`	This mirror might be considered as an alternative. (No other mirror sites are known.)	`db`
`refseqFTPBaseURL`	String		`ftp.ncbi.nih.gov`		`db`
`useHttp`	boolean		`true`	Use http(s) to download data from NCBI. If `false`, then Genestrip will do anonymous FTP instead (with login and password set to `anonymous`).	`db`
`ignoreMissingFastas`	boolean		`false`	If `true`, then a download of files from NCBI will not stop in case a file is missing on the server.	`db`
`maxDownloadTries`	int	[1, 1024]	`5`	The number of download attempts for a file before giving up.	`db`
`seqType`	nominal	`GENOMIC`, `RNA`, `BOTH`	`GENOMIC`	Which type of sequence files to include from the RefSeq. Possible values are `genomic`, `rna` or `both`. RNA files from the RefSeq end with `rna.fna.gz`, whereas genomes end with `genomic.fna.gz`.	`db`
`rankCompletionDepth`	nominal	`superkingdom`, `kingdom`, `phylum`, `subphylum`, `superclass`, `class`, `subclass`, `superorder`, `order`, `suborder`, `superfamily`, `family`, `subfamily`, `clade`, `genus`, `subgenus`, `species group`, `species`, `varietas`, `subspecies`, `serogroup`, `biotype`, `strain`, `serotype`, `genotype`, `forma`, `forma specialis`, `isolate`, `no rank`	`no rank`	The rank up to which tax ids from `taxids.txt` will be completed by descendants of the taxonomy tree (the set rank included). If not set, the completion will traverse down to the lowest possible levels of the taxonomy. Typical values could be `genus`, `species` or `strain`, but all values used for assigning ranks in the taxonomy are possible.	`db`
`maxGenomesPerTaxid`	int	[1, 2147483647]	`2147483647`	The maximum number of genomes per tax id from the RefSeq to be included in the database. Note, that this is an important parameter to control database size, because in some cases, there are millions of genomic entries for a tax id such as for `573` (which does not even account for entries of its descendants).	`db`
`completeGenomesOnly`	boolean		`false`	If `true`, then only genomic accessions with the prefixes `AC`, `NC_`, `NZ_` will be considered when generating a database. Otherwise, all genomic accessions will be considered. See RefSeq accession numbers and molecule types for details.	`db`
`refSeqLimitForGenbankAccess`	int	[0, 2147483647]	`0`	Determines whether Genestrip should try to lookup genomic fasta files from Genbank, if the number of corresponding reference genomes from the RefSeq is below the given limit for a requested tax id. E.g. `refSeqLimitForGenbankAccess=1` would imply that Genbank is consulted if not a single reference genome is found in the RefSeq for a requested tax id. The default `refSeqLimitForGenbankAccess=0` essentially inactivates this feature. In addition, Genbank access is also influenced by the keys `fastaQualities` and `maxFromGenBank` (see below).	`db`
`maxFromGenBank`	int	[-1, 2147483647]	`1`	Determines the maximum number of fasta files used from Genbank per requested tax id. If the corresponding number of matching files exceeds `maxFromGenBank`, then then best ones according to `fastaQualities` will be retained to still match this maximum.	`db`
`fastaQualities`	list of nominals	`ADDITIONAL`, `COMPLETE_LATEST`, `COMPLETE`, `CHROMOSOME_LATEST`, `CHROMOSOME`, `SCAFFOLD_LATEST`, `SCAFFOLD`, `CONTIG_LATEST`, `CONTIG`, `LATEST`, `NONE`	`COMPLETE_LATEST,CHROMOSOME_LATEST`	Determines the allowed quality levels of fasta files from Genbank. The values must be comma-separated. If a corresponding value is included in the list, then a fasta file for a requested tax id on that quality level will be included, otherwise not (while also respecting the conditions excerted via the keys `refSeqLimitForGenbankAccess` and `maxFromGenBank`). The quality levels are based on Genbank's Assembly Summary File (columns `version_status` and `assembly_level`).	`db`
`kMerSize`	int	[15, 64]	`31`	The number of base pairs k for k-mers. Changes to this values do not affect the memory usage of database. A value > 32 will cause collisions, i.e. leads to false positives for the `match` goal.	`db`
`maxDust`	int	[-1, 2147483647]	`-1`	When generating a database via the goal `db`, any low-complexity k-mer with too many repetitive sequences of base pairs may be omitted for storing. To do so, Genestrip employs a simple genetic dust-filter for k-mers: It assigns a dust value d to each k-mer, and if d > `maxDust`, then the k-mer will not be stored. Given a k-mer with n repeating base pairs of repeat length k(1), ... k(n) with k(i) > 1, then d = fib(k(1)) + ... + fib(k(n)), where fib(k(i)) is the Fibonacci number of k(i). E.g., for the 8-mer `TTTCGGTC`, we have n = 2 with k(1) = 3, k(2) = 2 and d = fib(3) + fib(2) = 2 + 1 = 3. For practical concerns `maxDust = 20` may be suitable. In this case, if 31-mers were uniformly, randomly generated, then about 0.2 % of them would be omitted. If `maxDust = -1`, then dust-filtering is inactive.	`db`
`classifyReads`	boolean		`true`	Whether to do read classification in the style of Kraken and KrakenUniq. Matching is faster without read classification and the columns `kmers`, `unique kmers` and `max contig length` in resulting CSV files are usually more conclusive anyways - in particular with respect to long reads. When read classification is off, the columns `reads` and `kmers from reads` will be 0 in resulting CSV files.	`match`
`countUniqueKMers`	boolean		`true`	If `true`, the number of unique k-mers will be counted and reported. This requires less than 5% of additional main memory.	`match`, `matchlr`
`writeFilteredFastq`	boolean		`false`	If `true`, then the goal `match` writes filtered fastq files in the same way that the goal `filter` does.	`match`, `matchlr`
`writeKrakenStyleOut`	boolean		`false`	If `true`, Genestrip will write output files with suffix `.out` in the Kraken output format under `<base dir>/projects/<project_name>/krakenout` covering all reads with at least one matching k-mer.	`match`, `matchlr`
`normalizedKMersFactor`	long	[1, 9223372036854775807]	`1000000000`	A factor used to compute `normalized kmers` at read analysis time.	`match`, `matchlr`
`useBloomFilterForMatch`	boolean		`true`	If `true` a bloom filter will be loaded and used during fastq file analysis (i.e. matching). Using the bloom filter tends to shorten matching time, if the most part of the reads cannot be classified because they contain no k-mers from the database. Otherwise, using the bloom filter might increase matching time by up to 30%. It also requires more main memory.	`match`, `matchlr`
`maxReadTaxErrorCount`	double	[0.0, 1.7976931348623157E308]	`0.5`	The absolute or relative maximum number of k-mers that do not have to be in the database for a read to be classified. If the number is above `maxReadTaxErrorCount`, then the read will not be classified. Otherwise the read will be classified in the same way as done by Kraken. If `maxReadTaxErrorCount` is >= 1, then it is interpreted as an absolute number of k-mers. Otherwise (and so, if >= 0 and < 1), it is interpreted as the ratio between the k-mers not in the database and all k-mers of the read.	`match`, `matchlr`
`maxKMerResCounts`	int	[0, 65536]	`0`	If > 0, the corresponding number of frequencies of the most frequent k-mers per tax id will be reported.	`match`, `matchlr`
`writeDumpedFastq`	boolean		`false`	If `true`, then `filter` will also generate a fastq file `dumped_...` with all reads not written to the corresponding filtered fastq file.	`filter`
`minPosCountFilter`	int	[0, 1024]	`1`	The mininum number of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. If `minPosCountFilter=0`, then `posRatioFilter` becomes effective.	`filter`
`posRatioFilter`	double	[0.0, 1.0]	`0.2`	Only effective if `minPosCountFilter=0`: The mininum ratio of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file.	`filter`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConfigParams.md

ConfigParams.md

Files

ConfigParams.md

Latest commit

History

ConfigParams.md

File metadata and controls