List of accession numbers for contamination in NR #2

pmenzel · 2020-01-27T10:54:38Z

Thanks for addressing the contamination issue (again)!

I am interested in cleaning up the NR database before using it for my application.
Would it be possible to also make available the list of the problematic accession numbers that you found in your contamination screen of the NR database?

thanks!
Peter

martin-steinegger · 2020-01-28T04:46:57Z

Thank you @pmenzel
I will upload the results from the NR to the FTP tomorrow.

martin-steinegger · 2020-01-30T19:17:41Z

Sorry for the delay. I have added the NR files to the ftp ftp://ftp.ccb.jhu.edu/pub/data/conterminator

There are two files (1) nr.ids.gz, which only contains the identfier and (2) nr.gz, which shows where the protein originates from.

pmenzel · 2020-01-31T10:17:59Z

Great thanks! NB, the nr.ids.gz file has 14149 entries, while the manuscript mentions 14132 predicted contaminant proteins.

martin-steinegger · 2020-01-31T16:42:43Z

Thank you for catching this! The reported number in the paper is from the kraken report. I have lost some entries while converting the conterminator result to a kraken output using krakenuniq-report (report nr.krakenreport.zip). This due to an inconsistency between the NCBI tax dumps used by conterminator and the kraken-uniq database. I am sorry for the confusion.

You have to be a bit careful with the protein predictions. In the protein case it is harder for conterminator to predict the directioinallity of the contamination correct. Because it can only use abundance information (for nucleotides databases it uses the length). But there is way to increase the precision by trading sensitivity if you only select 1 to N kingdom mappings. The contamination is then most likely the protein that occurs only once. You can select this case with the following awk command.

awk '(($2==1)+($3==1)+($4==1)+($5==1)+($6==1))==1 && (($2>1)+($3>1)+($4>1)+($5>1)+($6>1))==1 '  <(zcat nr.gz)

martin-steinegger · 2020-01-31T17:10:46Z

I will also upload a list of protein ids encoded on the short contamianted nucleotide contigs. I assume it might be useful for you to also remove these.

smsaladi · 2020-02-24T21:22:10Z

Thanks for this! Do you happen to have the Uniprot identifiers handy (7359 entries mentioned in the preprint)? This would be useful to me.

Also, as an aside, have you considered hosting these accession numbers and maybe a little metadata in a simple html table (say with jquery-datatables loaded)? Maybe with gh-pages. It would make the data a bit more discoverable -- and also accessible to non-computational folks.

XiongGZ · 2020-04-22T11:56:41Z

when i use it just stop in "Download taxdump.tar.gz", i had download taxdump.tar.gz from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz. However i don't where to put it. Can you tell me what should i do please.

martin-steinegger · 2020-04-24T04:27:05Z

@xgz-98 Conterminator automatically downloads the taxdump from the NCBI site. You only need to provide a fasta file and the respective mapping from identifier to taxid.

XiongGZ · 2020-04-24T04:49:47Z

when i use command "conterminator dna example/dna.fna example/dna.mapping ${RESULT_PREFIX} tmp", it always stop in createtaxdb step.

Download taxdump.tar.gz
tar: Skipping to next header
2020-04-18 14:36:42 URL:https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [51859296/51859296] -> "-" [7]

gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Error: createtaxdb step died

i don't how to solve it.

martin-steinegger · 2020-04-24T05:44:26Z

@xgz-98 I have opened separate issue #5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of accession numbers for contamination in NR #2

List of accession numbers for contamination in NR #2

pmenzel commented Jan 27, 2020

martin-steinegger commented Jan 28, 2020

martin-steinegger commented Jan 30, 2020

pmenzel commented Jan 31, 2020

martin-steinegger commented Jan 31, 2020 •

edited

Loading

martin-steinegger commented Jan 31, 2020

smsaladi commented Feb 24, 2020

XiongGZ commented Apr 22, 2020

martin-steinegger commented Apr 24, 2020

XiongGZ commented Apr 24, 2020 •

edited

Loading

martin-steinegger commented Apr 24, 2020

List of accession numbers for contamination in NR #2

List of accession numbers for contamination in NR #2

Comments

pmenzel commented Jan 27, 2020

martin-steinegger commented Jan 28, 2020

martin-steinegger commented Jan 30, 2020

pmenzel commented Jan 31, 2020

martin-steinegger commented Jan 31, 2020 • edited Loading

martin-steinegger commented Jan 31, 2020

smsaladi commented Feb 24, 2020

XiongGZ commented Apr 22, 2020

martin-steinegger commented Apr 24, 2020

XiongGZ commented Apr 24, 2020 • edited Loading

martin-steinegger commented Apr 24, 2020

martin-steinegger commented Jan 31, 2020 •

edited

Loading

XiongGZ commented Apr 24, 2020 •

edited

Loading