Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to flag similar strings #75

Open
allaway opened this issue Oct 21, 2022 · 3 comments
Open

Add function to flag similar strings #75

allaway opened this issue Oct 21, 2022 · 3 comments

Comments

@allaway
Copy link
Collaborator

allaway commented Oct 21, 2022

We currently do not standardize PI or institution names. It would be helpful to do this on a semi-regular basis.

It would be great if we could have a function that flags similar strings in the Studies table, and add it as, say, a weekly or quarterly job. It would probably require manual intervention to actually fix the data.

@allaway
Copy link
Collaborator Author

allaway commented Oct 21, 2022

Quick and dirty example:

library(stringdist)
library(dplyr)
library(tibble)
library(synapser)
synLogin()

foo <- synTableQuery('select distinct unnest(studyLeads) as pi from syn16787123')$asDataFrame()

dist <- stringdist::stringdistmatrix(foo$pi, method = "jw") %>% 
  as.matrix() %>% 
  as_tibble()

pheatmap::pheatmap(dist)

colnames(dist) <- foo$pi
dist["pi_1"] <- foo$pi

tidy_names <- tidyr::gather(dist, !contains("pi_1"), key = "pi_2", value = "dist")%>% 
  filter(dist != 0) %>% 
  arrange(dist)


Which yields:
Screen Shot 2022-10-21 at 2 23 37 PM

Interestingly, one of the more prevalent issues appears to be trailing/leading whitespace, probably from older manual copy-pasting...

Anything above 0.2 j-w seems to be truly distinct, whereas <0.2 seems to deserve closer inspection.

@allaway
Copy link
Collaborator Author

allaway commented Oct 21, 2022

Similar for institutions:

foo <- synTableQuery('select distinct unnest(institutions) as inst from syn16787123')$asDataFrame()

dist <- stringdist::stringdistmatrix(foo$inst, method = "jw") %>% 
  as.matrix() %>% 
  as_tibble()

pheatmap::pheatmap(dist)

colnames(dist) <- foo$inst
dist["inst_1"] <- foo$inst

tidy_names <- tidyr::gather(dist, !contains("inst_1"), key = "inst_2", value = "dist")%>% 
  filter(dist != 0) %>% 
  arrange(dist)

yields:

Screen Shot 2022-10-21 at 2 30 22 PM

However, this isn't as easy to scan manually because of all of the high-similarity University of ... matches that really hide some of the true matches/values that need correction - can you spot them here? ;)

Screen Shot 2022-10-21 at 2 30 39 PM

@allaway
Copy link
Collaborator Author

allaway commented Oct 31, 2022

PI names in screenshot above have been standardized. I picked whichever one was more recent as the "standard."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant