-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata and download #126
Comments
Hi Cecile, I am quite interested in the possibility to get Biosample attributes (such as host, collection date, geographic location, ...) from prokaryotic genomes downloaded using ncbi-genome-download. Thks |
Hi, I don't think you can have these information directly by using ncbi-genome-download, it will probably require to interrogate other databases. In general, this kind of information is not well standardized so it can be triggered to automatize interrogation. My first idea would be to isolate bioproject id and/or biosample id from metadata file provided by ncbi-genome-download and use NCBI Entrez API to search in NCBI BioProject and BioSample databases. For Python, I know it can be used with Biopython library (http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec138), but maybe simpler libraries exist. But by checking rapidly it doesn't seem to have much information (to not say not at all) in bioProject and bioSample databases. import requests
r = requests.get("https://www.ebi.ac.uk/ena/data/view/SAMEA1705929&display=xml") Then you need to parse xml, I know beautifoulsoup4 or xml.etree python libraries can do that, but I don't exactly know how to use them. |
excellent thanks. |
Hello,
First thanks for your tool, it's really useful and works great.
I have a remark about the metadata file. I don't know if you're aware of that, but when we download several file formats (example : -F fasta,genbank), only the last downloaded file is written in the metadata local_file column. It's a detail but maybe will it be better to have all files specified ? (I don't really know how, maybe one column per file format ?)
Also for my own needs, I wanted to have information written on the metadata file even if genomes already exist in my output directory. To do that, I made some dirty modifications of the code but if you're interested in this functionality I can make a proper version and propose a pull request.
The text was updated successfully, but these errors were encountered: