Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dealing with updates to GBIF #23

Open
LevanBokeria opened this issue Oct 31, 2023 · 1 comment
Open

dealing with updates to GBIF #23

LevanBokeria opened this issue Oct 31, 2023 · 1 comment
Assignees

Comments

@LevanBokeria
Copy link
Contributor

LevanBokeria commented Oct 31, 2023

As mentioned in our meeting, I discovered that some URLs to images get deleted from the GBIF database.

On Baskerville, in folder /bask/projects/v/vjgo8416-amber/data/gbif_download_standalone/dwca_files/ you will find two dwca files, one for Sesiidae downloaded in August 2023 and an updated one downloaded in October 2023.

Those files are also uploaded here on our sharedrive, for those without Baskerville access:

Attached here is also a CSV file for one of the UK species for which I (by chance) noticed that images no longer get downloaded if I point to the October dwca file instead of the August one. The species is "Pyropteron chrysidiformis". I presume similar issue might have occurred with other species too.

@KatrionaGoldmann the result can be easily reproduced by using the 03_download_images/fetch_images_whole_dwca_wrapper.ipynb notebook, and changing the dwca_dir argument to point to the folder containing extracted files from either the October or the August Sesiidae dwca file. The results will show that when pointing to the August file we get some images downloaded, but not when pointing to the October file.

This cannot be an issue with the URLs being broken, because the August dwca files still have the URLs which work. So the URL entries themselves must have been deleted from the October file, or perhaps the whole occurrence records have been deleted, including the URLs.

uksi-moths-keys-nodup-small-Pyropteron-chrysidiformis.csv

@LevanBokeria LevanBokeria converted this from a draft issue Oct 31, 2023
@KatrionaGoldmann
Copy link
Member

Notes from the tech team meeting discussion:

  • Updates to the GBIF database can result image deletion and/or addition.
  • The download pipeline must therefore be rerun from scratch following every significant update in order to create models based on database snapshots

@KatrionaGoldmann KatrionaGoldmann moved this from 🥚 Todo to 🫛 For discussion or future consideration in AMBER Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🫛 For discussion or future consideration
Development

No branches or pull requests

2 participants