Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
lucy-schick committed Jul 11, 2024
1 parent 3bb4f4b commit 4b1dc06
Show file tree
Hide file tree
Showing 6 changed files with 25,782 additions and 0 deletions.
89 changes: 89 additions & 0 deletions R/cdc.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@


1. could determine which columns in `data-raw/cdc/cdc.csv` are needed to tie Species Code to Element Code. Make a new csv called `xref_sp_element_codes.csv` (or something) and burn just those columns to it

```{r select-columns}
## select species codes and elements codes from cdc.csv
xref_sp_element_codes <- cdc |>
select("Species Code", "Element Code")
#burn to csv
readr::write_csv(xref_sp_element_codes, "data-raw/cdc/xref_sp_element_codes.csv")
```

2. download both exports from the cdc website into data-raw. keep their names as is if they are descriptive (can't remember)

3. In a new `data-raw/cdc.R` file read in xl files with` readxl` (I think you need to open them first and close or some weird thing to avoid a error - link to help url in your .R file if you run into it) and export both of those as csvs with their original exported names in data-raw.

```{r import-data}
## Read in results from cdc website
results_raw <- readr::read_csv("data-raw/cdc/resultsExport.csv")
## Read in conservation status info from cdc website
constat_raw <- readr::read_csv("data-raw/cdc/ConsStatusRptExport.csv")
```


4. In `data-raw/cdc.R` join both exports from the cdc website together by Element Code excluding any duplicated columns - then join to `xref_sp_element_codes.` if there are missing Species Code entries in any rows we have new codes to find (don't know how yet) and if there are less unique(cdc$Species Code) than before we lost some. Can document that in data

```{r join-data}
#Join the two dataframe
cdc_prep1 <- left_join(results_raw, constat_raw,
by = c("Element Code",
"Scientific Name",
"English Name")) |>
select(-Provincial) ## remove duplicated column
```

## Some issues:
- We need to remove all columns that are not present in cdc.csv
- Then we need to rename all columns to match those in cdc.csv
- we need to separate the the Global review date in parentheses from the global ranking
```{r}
## lets compare columns names to see what we need to remove
dplyr::setdiff(names(cdc_prep1), names(cdc))
dplyr::setdiff(names(cdc), names(cdc_prep1))
## START HERE. DATES ARE NOT BEING EXTRACTED
cdc_prep2 <- cdc_prep1 |>
## We need to rename the columns to match those in the cdc.csv
rename("Prov Status" = "Provincial Status",
"Prov Status Review Date" = "Date Status Last Reviewed",
"Global Status" = "Global") |>
## we need to separate the the Global review date in parentheses from the global ranking
mutate("Global Status Review Date" = case_when(
str_detect("Global Status", "\\(\\d{4}\\)") ~ str_extract("Global Status", "\\(\\d{4}\\)"),
TRUE ~ NA_character_
)
## "Global Status" = str_replace("Global Status", "\\s*\\(\\d{4}\\)", "")
) |>
relocate("Global Status Review Date", .after = "Global Status")
## We need to remove all columns that are not present in cdc.csv
select(-c())
```


cdc_updated <- left_join(cdc_prep1, xref_sp_element_codes) |>
select(names(cdc))

waldo::compare(cdc, cdc_prep1)
```
5. burn over data-raw/cdc/cdc.csv
6. Run through `fishbc/data-raw/data-raw.R` and run usethis::use_data(cdc, overwrite = TRUE)
build the repo locally (just like fpr)
Worst that can happen is we redo using new branch....
Loading

0 comments on commit 4b1dc06

Please sign in to comment.