-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty (len:0) sequences in plass output #33
Comments
|
Thanks for the quick response! Then, which is the minimum reported length by default and why are there empty sequences? Shouldn't these be filtered out prior to writing the final assembly? |
Does it report a FASTA header and then no residues after the header? We shouldn't be able to produce 0-length sequences, I am not sure why that would happen. Do you have a small(er) set to reproduce this behavior? We should probably also add a parameter to filter out sequences under a given length. |
I already introduced a parameter for that in the nucleassemble command: --min-contig-len We could simply adopt it, or? I also ran a small example where I could reproduce the error of getting sequences of length 0, header: len:0 I only had a short look at the code so far, but it seems something with the db output goes wrong: filtercoding can produce sequences of length 0 when it finds non coding sequences, but this is not the actual problem, because the ‘only assembled’ filter should throw such sequences away afterwards anyway. BUT there seems to be a problem with newlines in the sequence db file outputed in the current version, confusing the awk command for checking start and stop codons in the following, if there exits more than one sequences per line. In this way sometimes the wrong sequences are sorted out and therefore we sometimes keep the sequences of length 0. I did not figure out the reason for the missing newlines in the db output, @martin-steinegger @milot-mirdita can one of you check that? It is not the case for all but only for some sequences. Or is that even desired, but then it still does not fit with the awk command for finding start and stop codons. |
Expected Behavior
Do not write sequences in the output shorter than
--min-length
, which is 45 aa by default.Current Behavior
Sequences shorter than
--min-length
are being written in the output, even empty ones (len:0).Steps to Reproduce (for bugs)
Plass Output (for bugs)
General output: https://gist.github.com/aleixop/76bd8e2fc4e9a88ba7072f470abbc600
Context
Co-assembly of ~300M PE reads with default parameters that runs smoothly without errors.
250 Gb of RAM and 48 cpus.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
5d03cce371dc51c23652a251550c33fd0358690d
The text was updated successfully, but these errors were encountered: