Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--match-per-kmer increasing indefinitely #97

Open
sean-workman opened this issue Oct 18, 2024 · 4 comments
Open

--match-per-kmer increasing indefinitely #97

sean-workman opened this issue Oct 18, 2024 · 4 comments

Comments

@sean-workman
Copy link

Hello Jaebom,

I am running into a strange problem using the Refseq Prokaryote/Viral DB to try and classify paired-end short reads. As the run goes on, it just keeps increasing the --match-per-kmer parameter and is unable to continue:

classify fastp_def/CMAB71413_OST_fastp_R1.fastq.gz fastp_def/CMAB71413_OST_fastp_R2.fastq.gz /home/sdwork/scratch/metabuli_dbs/refseq_prokaryote_virus me
tabuli_prokaryote_viral CMAB71413_OST --threads 32 --max-ram 150

Metabuli Version (commit):                            	77f39b659ec041699f72f15636655ee9500f7642
Threads                                               	32
Sequencing type                                       	2
Min. sequence similarity score                        	0
Min. query coverage                                   	0
Min. num. of cons. matches for non-euk. classification	4
Min. num. of cons. matches for euk. classification    	9
Min. score for species- or lower-level classification.	0
Allowed extra Hamming distance                        	0
Directory where the taxonomy dump files are stored
Mask residues                                         	0
Mask residues probability                             	0.9
RAM usage in GiB                                      	150
Number of matches per query k-mer.                    	4
Accession-level DB build/search                       	0
Best * --tie-ratio is considered as a tie             	0.95

DB name: 1287220998
DB creation date: 2024-4-1
Loading the list for taxonomy IDs ... Done
Indexing query file ...Done
Total number of sequences: 57358948
Total read length: 12407825166nt
Extracting query metamers ...
Time spent for metamer extraction: 19
Sorting query metamer list ...
Time spent for sorting query metamer list: 16
Comparing query and reference metamers...
--match-per-kmer was increased to 8 and searching again...
Extracting query metamers ...
Time spent for metamer extraction: 9
Sorting query metamer list ...
Time spent for sorting query metamer list: 9
Comparing query and reference metamers...
--match-per-kmer was increased to 12 and searching again...
Extracting query metamers ...
Time spent for metamer extraction: 6
Sorting query metamer list ...
Time spent for sorting query metamer list: 6
Comparing query and reference metamers...
--match-per-kmer was increased to 16 and searching again...
Extracting query metamers ...
Time spent for metamer extraction: 5
Sorting query metamer list ...
Time spent for sorting query metamer list: 4
Comparing query and reference metamers...
--match-per-kmer was increased to 20 and searching again...

.....

--match-per-kmer was increased to 144 and searching again...
Extracting query metamers ...
Time spent for metamer extraction: 1
Sorting query metamer list ...
Time spent for sorting query metamer list: 0
Comparing query and reference metamers...

These reads are able to be classified by the Refseq Viral DB, as well as the ICTV DB using your personal fork of Metabuli (#96). The time I had requested from our cluster ran out while the parameter was at 144. Any thoughts on what could be causing this?

Thanks!

@jaebeom-kim
Copy link
Member

Hi Sean!
Thank you for reaching out!
My personal fork mentioned here is currently compatible only with the ICTV DB.
I think the result from using refseq-viral DB must be errorous.
Please try the latest release in the main repository :)

@sean-workman
Copy link
Author

Oh yikes haha I assumed that working with the ICTV database was an additional feature. Will go give that a go, thanks!

@sean-workman
Copy link
Author

Oh - it would seem I am using the appropriate Metabuli.

metabuli Version: dc985814afbcbd2fda50878138165ea3006ce4f1

My Slurm submission script is identical to the one I successfully used with the Refseq Viral DB, except of course for changing which DB I am using.

@jaebeom-kim
Copy link
Member

Hmm.. The version written in the first comment of this issue is 77f39b659ec041699f72f15636655ee9500f7642.

Metabuli Version (commit): 77f39b659ec041699f72f15636655ee9500f7642.

If you still encounter an error with dc985814afbcbd2fda50878138165ea3006ce4f1, could you provide the sample to reproduce the error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants