layout | title |
---|---|
page |
08 -- String manipulations |
We are going to use the package stringr
to learn some common basic string
manipulation. In biology these are often needed to clean messy datasets, to work
with sequence data or extra data from unformatted data.
In base R, there are equivalent functions to the ones we will cover today, they are a little more challenging to use but may provide additional flexibility if you have very specific needs.
To learn the functions we'll cover today, we will work again with the list of sea cucumber specimens downloaded from iDigBio.
Let's get set up
## if you need to download the data
## download.file("http://r-bio.github.io/data/holothuriidae-specimens.csv",
## "data/holothuriidae-specimens.csv")
hol <- read.csv(file="data/holothuriidae-specimens.csv", stringsAsFactors=FALSE)
library(stringr)
The function str_length()
gives the number of characters in a string:
str_length(c("cat", "dog", "giraffe", "cute dog", "very cute kitten"))
## [1] 3 3 7 8 16
Which country where sea cucumbers have been collected, has the most letters in its name?
The function str_sub()
can take 3 arguments: a vector of class "character", a
beginning and an end. Negative numbers indicates characters from the end (the
last one being -1).
str_sub("a very cute kitten")
## [1] "a very cute kitten"
str_sub("a very cute kitten", start=1L, end=-1L)
## [1] "a very cute kitten"
str_sub("a very cute kitten", start=8)
## [1] "cute kitten"
str_sub("a very cute kitten", start=-6)
## [1] "kitten"
str_sub("a very cute kitten", start=3, end=6)
## [1] "very"
A common mistake in taxonomic data is that the wrong suffixes are used in order or class names. Check that the last 4 letters are the same for all this dataset.
str_sub()
can also be used to replace parts of a string:
cutest <- "The cutest animals are puppies"
str_sub(cutest, -7) <- "kittens"
str_c()
equivalent topaste()
but by default uses the empty strings as separator:
str_c("the cutest are ", c("cats", "dogs"), collapse=" and not ")
## [1] "the cutest are cats and not the cutest are dogs"
str_dup()
replicates a string as many times as specified
str_dup(c("wow ", "amazing "), 3)
## [1] "wow wow wow " "amazing amazing amazing "
str_trim()
removes leading and trailing spaces. It's very common when importing data (or cleaning up data entered in a form) that there are spaces that you don't want to have to deal with. This function removes them.str_pad()
add whitespace (or other characters) on the right, left or both to make a string a given length (width
)
str_trim(str_dup(str_pad(c("wow ", "amazing "), width = str_length("amazing"), side="right", pad = "!"), 3))
## [1] "wow !!!wow !!!wow !!!" "amazing amazing amazing"
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. Jamie Zawinski
Also see: this
sum(str_detect(hol$dwc.scientificName, ignore.case("holothuria")))
## Please use (fixed|coll|regexp)(x, ignore_case = TRUE) instead of ignore.case(x)
## [1] 2899
library(dplyr)
authors <- hol$dwc.scientificNameAuthorship %>%
str_replace(pattern = "å", replacement = "aa") %>%
str_replace(pattern = "ä", replacement = "ae") %>%
str_replace(pattern = "é", "e") %>%
str_replace("^HL", "Hubert Lyman") %>%
str_replace("^Krauss in", "") %>%
str_split("&") %>% unlist() %>%
str_split(", ") %>% unlist() %>%
str_extract("[[:alpha:]]+(.+[[:alpha:]])?") %>%
unique() %>% sort()
authors
## [1] "Augustin" "Bell" "Brandt"
## [4] "Caso" "Caycedo" "Cherbonnier"
## [7] "Chiaje" "Clark" "Deichmann"
## [10] "Delle Chiaje" "Domantay" "Erwe"
## [13] "Feral" "Fisher" "Forskaal"
## [16] "Gaimard" "Gmelin" "Hubert Lyman Clark"
## [19] "Jaeger" "Kerr" "Laguarda-Figueras"
## [22] "Lampert" "Lesson" "Linnaeus"
## [25] "Ludwig" "Marenzeller" "Massin"
## [28] "Miller" "Mitsukuri" "Pawson"
## [31] "Pourtales" "Pourtalès" "Purcell"
## [34] "Quoy" "Rowe" "Samyn"
## [37] "Selenka" "Semper" "Sluiter"
## [40] "Solís-Marín" "Tan Tiu" "Theel"
## [43] "Tomascik" "Uthicke" "von Marenzeller"
- the package
stringr
has a good vignette (on which this lecture is based) - the package
rex
provides an easier way of writing regular expressions. - This website provides reference and tutorials for regular expressions.
- This tool allows you to check what your regular expression will match on text you provide