Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest FAO 2024 data #15

Open
theamarks opened this issue Aug 23, 2024 · 7 comments
Open

Ingest FAO 2024 data #15

theamarks opened this issue Aug 23, 2024 · 7 comments
Assignees
Labels
🪄 enhancement New functionality or feature request 🧼 data cleaning wrangle, organize, and clean data

Comments

@theamarks
Copy link
Member

theamarks commented Aug 23, 2024

There appeared to be missing species with the data test run during the close out meeting with Rahul. However, it appears this was run with the 2023 release, not the 2024 release.

Note: There have been updates to Fishbase and Sealife base (e.g., to some scientific names and common names), so we will want these integrated as part of the new run. -- Done

@theamarks theamarks added 🧼 data cleaning wrangle, organize, and clean data 🪄 enhancement New functionality or feature request labels Aug 23, 2024
@theamarks theamarks changed the title Integration of FAO 2024 data #46 Integration of FAO 2024 data Aug 23, 2024
@theamarks
Copy link
Member Author

theamarks commented Nov 21, 2024

Most of relevant code is within 01-clean-input-data.R script of the model. Fairly sure new HS code list comes from API subscription from Comtrade

To-do

  • remove hard coded filename in classify_prod_dat()
    filename = "GlobalProduction_2022.1.1.zip", # "GlobalProduction_2023.1.1.zip"
  • Add variable assignment and conditional chunk to determine which rebuild_fao_[yyyy]() function version to run No scenario where we would need to run release model with different fao year function.
  • write new function rebuild_fao_2024() 06cd35d
    • add arrange() to DSD and timeseries to ensure consistent column order (not the same for 2024 version) Not needed - is relative to the current working data version - taken care of with: select(all_of(ds$Concept_id))
      # IMPORTANT: row ORDER (ABCDEF) in DSD file should match columns ABCDEF in time series for looping to work below
    • tidy formatting - section titles, concise indents, add comments 299790f
    • replace all read.csv() and write.csv() with fread() and fwrite()
    • append object names with path that contain file path text
    • replace paste() with paste0() where appropriate no changes needed
    • Test that rebuild_fao_2024.R is what we expect (no data manipulation changes made)
# Run the function for both versions
test_2024 <- rebuild_fao_2024_dat(datadir = datadir, filename = filename)
test_2022 <- rebuild_fao_2024_dat(datadir = datadir, filename = filename)
identical(test_2022, test_2024)
[1] TRUE

Sorry, something went wrong.

@theamarks theamarks added this to the ARTIS v2.0 milestone Nov 22, 2024
@theamarks theamarks moved this from 🏷 Ready to 🏗 In Progress in ARTIS Maintence & Analysis Nov 22, 2024
@theamarks theamarks changed the title Integration of FAO 2024 data Ingest FAO 2024 data Nov 25, 2024
@theamarks
Copy link
Member Author

theamarks commented Dec 16, 2024

2024-12-16 Update & Questions

Sorry, something went wrong.

@theamarks
Copy link
Member Author

theamarks commented Dec 18, 2024

2024-12-18 Comments & Questions

  • @jagephart Do we keep old fixes and continue to append new ones to this list of manual fixes? OR run without manual fix part of pipe and start with a fresh "fix" list from running downstream sealifebase and fishlifebase code? This would perhaps allow for SLB/FLB package fixes? Not sure what the protocol is
    # List of fixes comes from finding SciNames that do not match to either the fishbase classification database or fishbase synonyms function in downstream code
  • Add argument to classify_prod_dat() to include new_data <- TRUE OR get_missing_sciname <- TRUE to create dataframe of SciNames with missing classifications that need to be researched and manually corrected in lines above. Thoughts? Probably good idea for explicity documenting data ingestion steps - dev_mode = FALSE
    # IMPORTANT: in the process, find SciNames with no classification info and replace synonyms with accepted names in prod_ts
  • Check if SLB/FB explicit file paths are still appropriate with conditional "get new SLB version" code in 01-clean-input-data.R - may need to feed flexible object with most recent SLB version looks good. most recent SLB file path object is input for classify_prod_dat(fb_slb_dir = current_fb_slb_dir) arguement
  fishbase <- read.csv(file.path(fb_slb_dir, "fb_taxa_info.csv"))
  sealifebase <- read.csv(file.path(fb_slb_dir, "slb_taxa_info.csv"))

Sorry, something went wrong.

@theamarks
Copy link
Member Author

theamarks commented Jan 2, 2025

2025-01-02 Questions

  • Is it correct that we expect "many-to-many" joins here? Would be good to make explicit in code with relationship = "many-to-many" arguments. Fewer warnings mean smoother runs.

    inner_join((fishbase %>% select(-Species)),

    one row of production seems like it should only match to one row in fish base Otherwise, it seems like there's more that it seems like there's them. Take that fishbase dataframe and check if the number of rows versus the number of species or ldo a group by and tally to see what's showing up more than once. The only case where you would get it matching more than once is if we have the same species name show up for multiple like habitat types, which is correct, but then we need to make sure it's joining on habitat too. Each species should join onto multiple columns in production or can, but each row of production should only join to one species name.

  • Why are order, class, and superclass not coded into the prod dataframe like species, genus, and family? Is this because text patterns are harder to identify and code? Match to Fishbase columns is more reliable?

    prod_fb_order <- prod_taxa_names %>%

    Could take another approach like only using Other01 column and matching to appropriate taxa level column in fishbase. Could not remember why developed like this. No particular problem here, just clarifying.

  • Do we intentionally want to prefer fishbase taxonomic information over sealifebase? Generation of "no match" names indicates this.

    nomatch_fb <- prod_taxa_names$SciName[prod_taxa_names$SciName %in% prod_fb_full$SciName==FALSE] # 613

    both unsure of overlap between fishbase and sealifebase. With further reading of documentation, fishbase covers finfish, and sealifebase covers other marine species. However because of taxonomic gray areas, there may be overlap. Checked - No species overlap between fb_taxa_info.csv and slb_taxa_info.csv from 20240722 download and > intersect(prod_fb_full$SciName, prod_slb_full$SciName) character(0)

Warning

The SpecCode field serves as a unique identifier for species within each database—FishBase and SeaLifeBase.
However, these codes are not standardized across both databases. This means that the same SpecCode value in FishBase could correspond to a different species in SeaLifeBase, and vice versa. Therefore, when integrating data from both sources, it's essential to use additional identifiers, such as scientific names, to accurately match species across the two databases.

Sorry, something went wrong.

@theamarks
Copy link
Member Author

theamarks commented Jan 15, 2025

2025-01-15 Questions / Notes

  • Thoughts on switching current fishbase & sealifebase local versioning protocol to one that uses stable "snapshot" captured by rfishbase? Brought up in ARTIS discussion 👉 Taxonomic data - versioning, sources, and classification structure #44

  • I emailed the WoRMS data manager - asked if there was an avenue in the API to query historical versions of WoRMS.

    Database versions - in the strict sense - cannot be queried.
    We have over 200k edits/year, so it’s a very dynamic system.
    We do have monthly snapshots/dumps.

    Need to think on if ARTIS taxa matching would be ok using dynamic WoRMS database through API or worrms::. Maintaining a local copy of WoRMS sounds like a big lift especially if we transition away from maintaining fishbase and sealifebase locally. Ideally, worrms:: package would develop a similar strategy as rfishbase. Both developed and maintained by rOpenSci group. Could float idea in a github repo issue.

submitted request for quarterly download of WoRMS database 2025_01_16

Sorry, something went wrong.

@theamarks
Copy link
Member Author

theamarks commented Jan 29, 2025

2025-01-29 Questions

  • @jagephart Where does this All_HS_Codes.csv file come from? I can't find any code in the entire model repo that looks like it creates this file, only code that reads it in. clean_hs() manually cleans up the All_HS_Codes$description text and outputs the object hs_data_clean. I see in the raw_baci/ that each HS version directory has a file product_codes_[HS_version]_[baci_version].csv with code and description columns. My guess is that All_HS_Codes.csv compiles these and the script is lost somewhere? Does this file need updating with each new BACI version release?

  • Do we know why standardize_prod.R and standardize_baci.R both exist? Both are inside standardize_countries.R have nearly identical country cleaning. Do we want these country cleaning to be the same across FAO and BACI data? Differences reported below using ChatGPT to analyze scripts.

The country standardization should be the same across the board. This should be pulled out of the code into a table so we can make this more available and transparent to users.

Differences Between `standardize_prod()` and `standardize_baci()`

1. Variable Naming

  • standardize_prod(): Uses artis_iso3c and artis_country_name
  • standardize_baci(): Uses artis_iso3 and artis_country_name

2. Handling of Country Codes

  • Both: Standardize territories (U.S., U.K., France, China, Netherlands, etc.).
  • Differences:
    • baci(): Adds mappings for Taiwan (Other Asia, nesTWN).
    • baci(): Groups Luxembourg with Belgium (LUXBEL).
    • prod(): Handles Southern African Customs Union (pre-2000 adjustments).
    • prod(): Handles independence transitions (Timor-Leste, Sudan, Serbia & Montenegro).

3. Filtering of Certain Countries

  • prod(): Removes historical countries (CSK, SUN, YUG).
  • baci(): No such filtering.

4. Handling "Other nei"

  • prod(): Assigns NEI to "Other nei" in col_country_name.
  • baci(): Also maps SMR, AND, and "US Misc. Pacific Isds" to NEI.

5. Country Name Standardization

  • Both: Use countrycode() to map ISO3 to country names.
  • Differences:
    • prod(): Special handling for Sudan (SDN, pre-2012) and South African Customs Union.
    • baci(): Handles Serbia and Montenegro (SCG).

Recommendations for Streamlining

  1. Extract Shared Logic: Most case_when() statements are identical. Create a table to contain classifications
  2. Unify Column Naming: Standardize on artis_iso3 or artis_iso3c.
Proposed Countries Classification Table Structure

Proposed Countries Classification Table Structure

raw_iso3c raw_country_name artis_iso3c artis_country_name grouping_category start_year end_year notes
ASM American Samoa USA United States U.S. Territory 1900 Present Maps U.S. territories to USA
GUM Guam USA United States U.S. Territory 1900 Present
MNP New Marianas Islands USA United States U.S. Territory 1900 Present
SRB Serbia SCG Serbia and Montenegro Former Yugoslavia 1992 2006 Before 2006, reported as SCG
MNE Montenegro SCG Serbia and Montenegro Former Yugoslavia 1992 2006 Before 2006, reported as SCG
MNE Montenegro MNE Montenegro Independent Country 2006 Present Independent from 2006 onward
TLS Timor-Leste IDN Indonesia Pre-Independence 1900 2002 Gained independence from Indonesia in 2002
TLS Timor-Leste TLS Timor-Leste Independent Country 2002 Present
SSD South Sudan SDN Sudan Pre-Independence 1900 2011 Gained independence from Sudan in 2011
SSD South Sudan SSD South Sudan Independent Country 2011 Present
BWA Botswana ZAF So. African Customs Union Trade Reporting 1900 2000 Before 2000, reported under SACU
BWA Botswana BWA Botswana Independent Trade Reporting 2000 Present
AIA Anguilla GBR Anguilla U.K. Territory 1900 Present
BMU Bermuda GBR Bermuda U.K. Territory 1900 Present
IOT British Indian Ocean Territory GBR British Indian Ocean Territory U.K. Territory 1900 Present
VGB British Virgin Islands GBR British Virgin Islands U.K. Territory 1900 Present
CYM Cayman Islands GBR Cayman Islands U.K. Territory 1900 Present
GIB Gibraltar GBR Gibraltar U.K. Territory 1900 Present
PCN Pitcairn Islands GBR Pitcairn Islands U.K. Territory 1900 Present
SHN St Helena GBR St Helena U.K. Territory 1900 Present
TCA Turks and Caicos GBR Turks and Caicos U.K. Territory 1900 Present
FLK Falkland Islands GBR Falkland Islands U.K. Territory 1900 Present
IMN Isle of Man GBR Isle of Man U.K. Territory 1900 Present
PYF French Polynesia FRA French Polynesia France Territory 1900 Present
MYT Mayotte FRA Mayotte France Territory 1900 Present
NCL New Caledonia FRA New Caledonia France Territory 1900 Present
SPM St. Pierre & Miquelon FRA St. Pierre & Miquelon France Territory 1900 Present
WLF Wallis and Futuna FRA Wallis and Futuna France Territory 1900 Present
HKG Hong Kong CHN Hong Kong China Territory 1900 Present
MAC Macao CHN Macao China Territory 1900 Present
ABW Aruba NLD Aruba Netherlands Territory 1900 Present
COK Cook Islands NZL Cook Islands New Zealand Territory 1900 Present
NIU Niue NZL Niue New Zealand Territory 1900 Present
NFK Norfolk Island AUS Norfolk Island Australia Territory 1900 Present
GRL Greenland DNK Greenland Denmark Territory 1900 Present
FRO Faroe Islands DNK Faroe Islands Denmark Territory 1900 Present

Sorry, something went wrong.

@theamarks
Copy link
Member Author

theamarks commented Jan 31, 2025

2025-01-31 Questions/Notes - Manual fixes vs, WoRMS matches

  • Big question: If we transition to WoRMS classification in the future or even now as patching in missing sciname values, will this make it more difficult to use/join FishBase attribute data for analysis? Do we want to prioritize this over possible "more accurate" taxonomic table? For example, WoRMS and FishBase have different classification schemas for osteichthyes. This would effect genus thunnus that matches in Fishbase and how actinopterygii matches using WoRMS.

    My plan (short of switching to a parent-child rankless classification system informed by WoRMS) was to expand the taxa table to include all ranks and fill in missing taxa by quering WoRMS API. However, this seems like it would introduce several divergent classification schemas for similar taxa. Need to think about how to handle this.

  • WoRMS API query loop via worrms package now checks matched record status, and points to "accepted" record.

  • Hybrid scinames are not matching in WoRMS - manual corrections still required.

  • Check through all existing manual edits to understand if they are still required. Search prod_ts$SciName, fb/slb, and synonym files.

  • Notes on missing scinames matched to Worms are in the artis-model/AM_local/outputs/nomatch_nosynonym_manfixes_worms_compare.csv file or Google Drive

Sorry, something went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪄 enhancement New functionality or feature request 🧼 data cleaning wrangle, organize, and clean data
Projects
Status: 🏗 In Progress
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants