Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of accession numbers for contamination in NR #2

Open
pmenzel opened this issue Jan 27, 2020 · 10 comments
Open

List of accession numbers for contamination in NR #2

pmenzel opened this issue Jan 27, 2020 · 10 comments

Comments

@pmenzel
Copy link

pmenzel commented Jan 27, 2020

Thanks for addressing the contamination issue (again)!

I am interested in cleaning up the NR database before using it for my application.
Would it be possible to also make available the list of the problematic accession numbers that you found in your contamination screen of the NR database?

thanks!
Peter

@martin-steinegger
Copy link
Collaborator

Thank you @pmenzel
I will upload the results from the NR to the FTP tomorrow.

@martin-steinegger
Copy link
Collaborator

Sorry for the delay. I have added the NR files to the ftp ftp://ftp.ccb.jhu.edu/pub/data/conterminator

There are two files (1) nr.ids.gz, which only contains the identfier and (2) nr.gz, which shows where the protein originates from.

@pmenzel
Copy link
Author

pmenzel commented Jan 31, 2020

Great thanks! NB, the nr.ids.gz file has 14149 entries, while the manuscript mentions 14132 predicted contaminant proteins.

@martin-steinegger
Copy link
Collaborator

martin-steinegger commented Jan 31, 2020

Thank you for catching this! The reported number in the paper is from the kraken report. I have lost some entries while converting the conterminator result to a kraken output using krakenuniq-report (report nr.krakenreport.zip). This due to an inconsistency between the NCBI tax dumps used by conterminator and the kraken-uniq database. I am sorry for the confusion.

You have to be a bit careful with the protein predictions. In the protein case it is harder for conterminator to predict the directioinallity of the contamination correct. Because it can only use abundance information (for nucleotides databases it uses the length). But there is way to increase the precision by trading sensitivity if you only select 1 to N kingdom mappings. The contamination is then most likely the protein that occurs only once. You can select this case with the following awk command.

awk '(($2==1)+($3==1)+($4==1)+($5==1)+($6==1))==1 && (($2>1)+($3>1)+($4>1)+($5>1)+($6>1))==1 '  <(zcat nr.gz)

@martin-steinegger
Copy link
Collaborator

I will also upload a list of protein ids encoded on the short contamianted nucleotide contigs. I assume it might be useful for you to also remove these.

@smsaladi
Copy link

Thanks for this! Do you happen to have the Uniprot identifiers handy (7359 entries mentioned in the preprint)? This would be useful to me.

Also, as an aside, have you considered hosting these accession numbers and maybe a little metadata in a simple html table (say with jquery-datatables loaded)? Maybe with gh-pages. It would make the data a bit more discoverable -- and also accessible to non-computational folks.

@XiongGZ
Copy link

XiongGZ commented Apr 22, 2020

when i use it just stop in "Download taxdump.tar.gz", i had download taxdump.tar.gz from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz. However i don't where to put it. Can you tell me what should i do please.

@martin-steinegger
Copy link
Collaborator

@xgz-98 Conterminator automatically downloads the taxdump from the NCBI site. You only need to provide a fasta file and the respective mapping from identifier to taxid.

@XiongGZ
Copy link

XiongGZ commented Apr 24, 2020

when i use command "conterminator dna example/dna.fna example/dna.mapping ${RESULT_PREFIX} tmp", it always stop in createtaxdb step.

Download taxdump.tar.gz
tar: Skipping to next header
2020-04-18 14:36:42 URL:https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [51859296/51859296] -> "-" [7]

gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Error: createtaxdb step died

i don't how to solve it.

@martin-steinegger
Copy link
Collaborator

@xgz-98 I have opened separate issue #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants