Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems downloading representative genome #99

Open
jotech opened this issue Oct 1, 2019 · 4 comments
Open

problems downloading representative genome #99

jotech opened this issue Oct 1, 2019 · 4 comments

Comments

@jotech
Copy link

jotech commented Oct 1, 2019

I'm trying to download the refseq representative genome for Vibrio lentus as it is listed here
https://www.ncbi.nlm.nih.gov/genome/?term=vibrio+lentus%5Borgn%5D
and it also has the coresponding refseq category
http://tiny.cc/hncrdz

But when I try to download the genomes

ncbi-genome-download --dry-run -R representative --taxid 136468 bacteria
ncbi-genome-download --dry-run -R representative --genus "Vibrio lentus" bacteria
ERROR: No downloads matched your filter. Please check your options.

Besides this, ncbi-genome-download --dry-run --taxid 136468 bacteria shows me all 87 available genomes but I'm looking for the representative only.
What do I miss?

@jrjhealey
Copy link
Contributor

jrjhealey commented Oct 1, 2019

This appears to be a problem with the use of -R representative, but more specifically with the actual data for that entry...

If you query the assembly_summary file for that genome (wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt):

$ grep "GCF_001691195.1" assembly_summary_refseq.txt
GCF_001691195.1 PRJNA224116     SAMN04867935    MAKA00000000.1  na      136468  136468  Vibrio lentus   strain=5F79             latest  Scaffold        Major   Full    2016/07/21      ASM169119v1     Massachusetts Institute of Technology     GCA_001691195.1 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1

If we compare this with an entry which has representative, there's an na where representative would be expected. I was successfully able to download the genome if providing the accession directly with -A GCA_001691195.1.

$ grep -i "representative" assembly_summary_refseq.txt | head -1
GCF_000001765.3 PRJNA18793      SAMN00779672    AADE00000000.1  representative genome   46245   7237    Drosophila pseudoobscura pseudoobscura  strain=MV2-25           latest  Chromosome      Major   Full    2013/04/11      Dpse_3.0  Baylor College of Medicine      GCA_000001765.2 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/765/GCF_000001765.3_Dpse_3.0

Perhaps Kai knows different but this appears to be an issue with the actual NCBI records?

Closer inspection of the summary file shows that all 87 genomes have na for that column. I don't think this is something this tool will be able to help you with in which case.

@jotech
Copy link
Author

jotech commented Oct 2, 2019

thanks for your answer!

The missing representative tag is really strange because it is actually there in the source assembly report:

curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1/GCF_001691195.1_ASM169119v1_assembly_report.txt | grep "RefSeq category"
# RefSeq category: Representative Genome

It seems there are inconsistent assembly reports?

@jrjhealey
Copy link
Contributor

That’s certainly how I would interpret that. There may be a good reason for the nas in the assembly summary, but if there are, i dont know what they are!

I think it would be worth contacting NCBI over this though in case it is a mistake.

@kblin
Copy link
Owner

kblin commented Oct 3, 2019

I concur with @jrjhealey, I think this is just an issue on the NCBI side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants