Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning fishbase scinames / territories #2

Open
2 tasks
quir1869 opened this issue Jan 16, 2025 · 1 comment
Open
2 tasks

Cleaning fishbase scinames / territories #2

quir1869 opened this issue Jan 16, 2025 · 1 comment
Labels
🧼 data cleaning wrangle, organize, and clean data

Comments

@quir1869
Copy link
Collaborator

quir1869 commented Jan 16, 2025

  • Scientific name cleaning

  • There are currently 93 ARTIS scientific names that are not being matched to fishbase or sealifebase. One name in the list, litopenaeus vannamei is an updated sciname for whiteleg shrimps, whereas fishbase/sealifebase lists whiteleg shrimps as penaeus vannamei. What process does the ARTIS model use to clean fb/slb scientific names, and can that be applied here to clean those names so that they can potentially match with the unmatched ARTIS species?

  • eez zone cleaning

  • When using countrycode() to match the reported countries / regions in fishbase/sealife base to iso3c, there are unmatched territories. It would be great to try to determine the eez regions given these territories, but to include these territories as additional rows to keep the original data.

! Some values were not matched unambiguously: Adelaide I., Admiralty Is., Alaska, Amsterdam I., Andaman Is., Ascension I., Azores Is., Balleny Is., Bon. Eust. Saba, Br Antarctic Tr, Canary Is., Cargados Carajos, Caroline I., Central Afr. Rp, Chagos Is., Channel Is., Chatham Is., Clipperton I., Crozet Is., Desventuradas Is, Dominican Rp, Easter I., Elephant I., Europa I., French South Tr, Galapagos Is., Glorieuses Is., Hawaii, Heard McDon Is., Jan Mayen I., Johnston I., Juan de Nova I., Juan Fernández, Kerguelen Is., Kermadec Is., Kosovo, Kuril Is., Lord Howe I., Macquarie Is., Madeira Is., Marquesas Is., Micronesia, Midway Is., Neth Antilles, Ogasawara Is., Pac Is Trust Tr, Peter I I., Prince Edward Is, Revillagigedo A., Rodriguez I., Ryukyu Is., S. Georg. Sandw., Scott I., Socotra Arch., South Orkney Is., South Shetland, St Martin (FR), St Paul's Rocks, St Paul I., St Pierre Mique., Terre Adélie, Trind. M.Vaz Is., Tristan da Cunha, Tuamotu Is., UK Engld Wal, UK No Ireld, UK Scotland, Wake I., West Sahara

unmatched_scinames.txt

@quir1869 quir1869 added the 🧼 data cleaning wrangle, organize, and clean data label Jan 16, 2025
@theamarks
Copy link
Member

theamarks commented Jan 17, 2025

@quir1869 Nice repo issue! I was wondering if my github notifications were set up properly, and they do! I'll respond to your bullets here.


There are currently 93 ARTIS scientific names that are not being matched to fishbase or sealifebase. One name in the list, litopenaeus vannamei is an updated sciname for whiteleg shrimps, whereas fishbase/sealifebase lists whiteleg shrimps as penaeus vannamei. What process does the ARTIS model use to clean fb/slb scientific names, and can that be applied here to clean those names so that they can potentially match with the unmatched ARTIS species?

A bit of background/orientation: The first step in running ARTIS (besides setting up the run environment) is to clean the raw data inputs. If you take a look at the root project, you will see a sequence of numbered .R scripts. When we run the full model, they run in order. 01-clean-input-data.R (see script here) is where the scientific name matching to fishbase and sealifebase happens along with manual fixes for taxa that remain unmatched. 01-clean-input-data.R writes out several files to the ./model_inputs/ directory which are then used in later number scripts that contain the proper ARTIS model.

Couple of Notes:

Thoughts on your specific question

The SAU data I shared with you a while back may have been run with an older snapshot of fishbase and sealifebase, possibly 2024_07_11. Which may introduce some weirdness. In this case the production data reports litopenaeus vannamei; according to WoRMS this is actually an "unaccepted" name record page here and penaeus vannamei is the accepted name record page here.

I don't have an immediate fix for this, I'll keep an eye out to see if this species also shows up in my unmatched and no synonym list. You could also check out the rfishbase::synonyms() function to check if there are synonyms, but I suppose the production data you have was already run through that in its creation.

I was playing around with the worrms:: package yesterday to pull AlphID (unique persistent identifiers we talked about) and the full classification info. This might be of some interest to you. It can also pull accepted names if we are dealing with an outdated name.

When using countrycode() to match the reported countries / regions in fishbase/sealife base to iso3c, there are unmatched territories. It would be great to try to determine the eez regions given these territories, but to include these territories as additional rows to keep the original data.

I think this is an issue with territories. I don't know at what level fishbase/sealifebase reports territories. They may aggregate territories to the "sovereign" country level causing territories in the production data you have unmatched. There is a part of the ARTIS model that deals with assigning territories, and honestly I haven't ventured to that corner yet so I can't speak in a lot of detail. It is however an open issue to create a model option to not aggregate territories to the country level. Seafood-Globalization-Lab/artis-model#14

But honestly, I think this is probably mostly stemming from the example SAU data you are working with. Once we have our next full model run, 🤞 hopefully you will have a cleaner and more clear dataset to work with. However, messy data is a great way to learn some data wrangling 🤠 .

@theamarks theamarks removed their assignment Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧼 data cleaning wrangle, organize, and clean data
Projects
None yet
Development

No branches or pull requests

2 participants