getDrugIngredientCodes and non UTF-8 characters #233

tleht · 2024-12-12T12:12:58Z

Describe the bug
Calling the function getDrugIngredientCodes with the argument "name" specified returns the following error

Error in `dplyr::filter()`:
ℹ In argument: `tidyWords(.data$concept_name) %in% tidyWords(.env$name)`.
Caused by error in `sub()`:
! input string 32262 is invalid UTF-8

The error comes from lines 
    ingredientConcepts <- cdm$concept %>% dplyr::filter(.data$standard_concept == 
        "S") %>% dplyr::filter(.data$concept_class_id == "Ingredient") %>% 
        dplyr::select("concept_id", "concept_name", "concept_code") %>% 
        dplyr::collect()
    if (!is.null(name)) {
        ingredientConcepts <- ingredientConcepts %>% dplyr::filter(tidyWords(.data$concept_name) %in% 
            tidyWords(.env$name))
    }

The error is caused by the standard RxNorm Extension drug ingredient concept 1253507 "[ ¹⁸ F]AlF-NOTA-FAPI-04" present in our concept-table.

To Reproduce
getDrugIngredientCodes(cdm = cdm, name = "Adalimumab")

The text was updated successfully, but these errors were encountered:

tleht · 2024-12-12T12:49:47Z

To be more exact, this seems to be an issue with the helper function tidyWords:

> tidyWords("[ ¹⁸ F]AlF-NOTA-FAPI-04")
Error in sub(re, "", x, perl = TRUE) : input string 1 is invalid UTF-8
In addition: Warning message:
In sub(re, "", x, perl = TRUE) :
  unable to translate '[ Â¹â<81>¸ F]AlF-NOTA-FAPI-04' to UTF-8

More specifically the following lines:

    Encoding(words) <- "latin1"
    
    # some generic formatting
    workingWords <- trimws(words)

edward-burn · 2025-01-28T12:51:01Z

Hi @tleht I was looking into this, but the above worked fine on my machine so it could quite possibly be related to locale. I tweaked the code a little in the new release - can you please see if it is now working for you?

library(CodelistGenerator)
packageVersion("CodelistGenerator")
#> [1] '3.3.2'
CodelistGenerator:::tidyWords("[ ¹⁸ F]AlF-NOTA-FAPI-04")
#> [1] "falf nota fapi 04"

^{Created on 2025-01-28 with reprex v2.0.2}

If you are still having problems could you please share the output from Sys.getlocale()?

tleht · 2025-01-29T07:15:21Z

I tried running this with the latest release, but it still keeps running into the same issue with the character ⁸:

library(CodelistGenerator)
packageVersion("CodelistGenerator")
[1] ‘3.3.2’
CodelistGenerator:::tidyWords("[ ¹⁸ F]AlF-NOTA-FAPI-04")
Error in sub(re, "", x, perl = TRUE) : input string 1 is invalid UTF-8
In addition: Warning message:
In sub(re, "", x, perl = TRUE) :
  unable to translate '[ Â¹â<81>¸ F]AlF-NOTA-FAPI-04' to UTF-8

Sounds likely that this might have to do with the locale or generally our system environment as apparently none of the other nodes running the script in our project ran into this specific issue. Here are the Sys.getlocale and sessionInfo calls:

Sys.getlocale()
[1] "LC_CTYPE=fi_FI.UTF-8;LC_NUMERIC=C;LC_TIME=fi_FI.UTF-8;LC_COLLATE=fi_FI.UTF-8;LC_MONETARY=fi_FI.UTF-8;LC_MESSAGES=fi_FI.UTF-8;LC_PAPER=fi_FI.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fi_FI.UTF-8;LC_IDENTIFICATION=C"

sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C               LC_TIME=fi_FI.UTF-8        LC_COLLATE=fi_FI.UTF-8    
 [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=fi_FI.UTF-8    LC_PAPER=fi_FI.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Helsinki
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] CodelistGenerator_3.3.2

loaded via a namespace (and not attached):
[1] compiler_4.4.2    cli_3.6.3         tools_4.4.2       rstudioapi_0.17.1 rlang_1.1.4       renv_1.0.11

edward-burn · 2025-01-29T08:35:45Z

Ah sorry that hasn't worked @tleht, would using iconv like below work for you?

concept_name <- "[ ¹⁸ F]AlF-NOTA-FAPI-04"
concept_name <- iconv(concept_name,
                   from = "UTF-8",
                   to = "UTF-8",
                   sub = "byte")
CodelistGenerator:::tidyWords(concept_name)
#> [1] "falf nota fapi 04"

^{Created on 2025-01-29 with reprex v2.1.0}

tleht · 2025-01-29T09:32:16Z

Already tried that back in December without any success.

concept_name <- "[ ¹⁸ F]AlF-NOTA-FAPI-04"
concept_name <- iconv(concept_name,
                      from = "UTF-8",
                      to = "UTF-8",
                      sub = "byte")
packageVersion("CodelistGenerator")
[1] ‘3.3.2’
CodelistGenerator:::tidyWords(concept_name)
Error in sub(re, "", x, perl = TRUE) : input string 1 is invalid UTF-8
In addition: Warning message:
In sub(re, "", x, perl = TRUE) :
  unable to translate '[ Â¹â<81>¸ F]AlF-NOTA-FAPI-04' to UTF-8

edward-burn · 2025-01-29T09:49:51Z

hmmm @tleht how about

library(stringi)
library(stringr)

concept_name <- "[ ¹⁸ F]AlF-NOTA-FAPI-04"
concept_name <- str_replace_all(concept_name, "[^\\x20-\\x7E]", "")
concept_name
#> [1] "[  F]AlF-NOTA-FAPI-04"
CodelistGenerator:::tidyWords(concept_name)
#> [1] "falf nota fapi 04"

^{Created on 2025-01-29 with reprex v2.1.0}

tleht · 2025-01-29T09:52:41Z

Thanks @edward-burn , that did the trick.

> library(stringi)
> library(stringr)
> 
> concept_name <- "[ ¹⁸ F]AlF-NOTA-FAPI-04"
> concept_name <- str_replace_all(concept_name, "[^\\x20-\\x7E]", "")
> concept_name
[1] "[  F]AlF-NOTA-FAPI-04"
> #> [1] "[  F]AlF-NOTA-FAPI-04"
> CodelistGenerator:::tidyWords(concept_name)
[1] "falf nota fapi 04"
> #> [1] "falf nota fapi 04"

edward-burn · 2025-01-29T10:38:43Z

Fantastic, to go back to your original issue - will installing the branch below now get things working?

# install branch
# remotes::install_github("darwin-eu/CodelistGenerator@tidy_words_2")

library(CDMConnector)
library(CodelistGenerator)
packageVersion("CodelistGenerator")

cdm <- mockVocabRef() # replace with a cdm reference to your data
getDrugIngredientCodes(cdm = cdm, name = "Adalimumab")

tleht · 2025-01-29T11:01:58Z

Yes, we now get the expected results when using the updated version of the function.

> library(CDMConnector)
> library(CodelistGenerator)
> packageVersion("CodelistGenerator")
[1] ‘3.3.2.900’

> cdm <- ...
> getDrugIngredientCodes(cdm = cdm, name = "Adalimumab")

── 1 codelist ───────────────────────────────────────────────────────────────────────────────────────────────────────

- 327361_adalimumab (1063 codes)

edward-burn · 2025-01-29T11:12:42Z

Wonderful, I will incorporate that in the next release (but leave the branch there until then so you can use it if needed in the meantime)

Will close this issue when this is implemented in the next cran release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getDrugIngredientCodes and non UTF-8 characters #233

getDrugIngredientCodes and non UTF-8 characters #233

tleht commented Dec 12, 2024 •

edited

Loading

tleht commented Dec 12, 2024

edward-burn commented Jan 28, 2025

tleht commented Jan 29, 2025 •

edited

Loading

edward-burn commented Jan 29, 2025

tleht commented Jan 29, 2025

edward-burn commented Jan 29, 2025

tleht commented Jan 29, 2025

edward-burn commented Jan 29, 2025

tleht commented Jan 29, 2025

edward-burn commented Jan 29, 2025

getDrugIngredientCodes and non UTF-8 characters #233

getDrugIngredientCodes and non UTF-8 characters #233

Comments

tleht commented Dec 12, 2024 • edited Loading

tleht commented Dec 12, 2024

edward-burn commented Jan 28, 2025

tleht commented Jan 29, 2025 • edited Loading

edward-burn commented Jan 29, 2025

tleht commented Jan 29, 2025

edward-burn commented Jan 29, 2025

tleht commented Jan 29, 2025

edward-burn commented Jan 29, 2025

tleht commented Jan 29, 2025

edward-burn commented Jan 29, 2025

tleht commented Dec 12, 2024 •

edited

Loading

tleht commented Jan 29, 2025 •

edited

Loading