-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of accession numbers for contamination in NR #2
Comments
Thank you @pmenzel |
Sorry for the delay. I have added the NR files to the ftp There are two files (1) |
Great thanks! NB, the nr.ids.gz file has 14149 entries, while the manuscript mentions 14132 predicted contaminant proteins. |
Thank you for catching this! The reported number in the paper is from the kraken report. I have lost some entries while converting the conterminator result to a kraken output using You have to be a bit careful with the protein predictions. In the protein case it is harder for conterminator to predict the directioinallity of the contamination correct. Because it can only use abundance information (for nucleotides databases it uses the length). But there is way to increase the precision by trading sensitivity if you only select
|
I will also upload a list of protein ids encoded on the short contamianted nucleotide contigs. I assume it might be useful for you to also remove these. |
Thanks for this! Do you happen to have the Uniprot identifiers handy (7359 entries mentioned in the preprint)? This would be useful to me. Also, as an aside, have you considered hosting these accession numbers and maybe a little metadata in a simple html table (say with jquery-datatables loaded)? Maybe with gh-pages. It would make the data a bit more discoverable -- and also accessible to non-computational folks. |
when i use it just stop in "Download taxdump.tar.gz", i had download taxdump.tar.gz from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz. However i don't where to put it. Can you tell me what should i do please. |
@xgz-98 Conterminator automatically downloads the taxdump from the NCBI site. You only need to provide a fasta file and the respective mapping from identifier to taxid. |
when i use command "conterminator dna example/dna.fna example/dna.mapping ${RESULT_PREFIX} tmp", it always stop in createtaxdb step. Download taxdump.tar.gz gzip: stdin: invalid compressed data--crc error gzip: stdin: invalid compressed data--length error i don't how to solve it. |
@xgz-98 I have opened separate issue #5 |
Thanks for addressing the contamination issue (again)!
I am interested in cleaning up the NR database before using it for my application.
Would it be possible to also make available the list of the problematic accession numbers that you found in your contamination screen of the NR database?
thanks!
Peter
The text was updated successfully, but these errors were encountered: