OUHSC Statistical Computing User Group
Will Beasley, Dept of Pediatrics,
Biomedical and Behavioral Methodology Core (BBMC)
- Review of Regex Part 1 from two meetings ago.
- Introduce a few more language-agnostic techniques
- Apply in a few languages
- Text editors
- Notepad++, Atom, or anything else halway-serious
- Languages
- R, Python, SAS, & most others.
- Databases
- First-class support in Postgres with succinct
~
and in MySQL withREGEXP
. And in Oracle withREGEXP_SUBSTR
andREGEXP_LIKE
, and evenREGEXP_REPLACE
. - It's tricky, but possible with SQLite and SQL Server.
- The standard/portable
LIKE
SQL operator might do everything you need anyway.
- First-class support in Postgres with succinct
A 'regex' is typically a carefully crafted string that describes a pattern of text. It can:
- Extract components of the text,
- Substitute components of the text, or
- Determine if the pattern simply appears in the text.
It's like the big brother of wildcards you match filenames with
(eg, "*.R"
).
Pattern | Matches |
---|---|
mike |
"mike", "smike", "miked", etc. |
mike4 |
"mike4", "smike4", etc. |
mike\d |
"mike" followed by any single digit (eg "mike8", "smike8") |
mike\d+ |
"mike" followed by one or more digits (eg "mike1234", "smike8") |
^mike$ |
only "mike" |
\b19(?=(1|2))(\d{2})\b
and 20\2
converts years in the 1910s and 1920s to the 2010s and 2020s
(but leaves later years as they are).
-
An online regex tester, regex101 (https://regex101.com/).
-
Example "subject" text in SCUG repo.
(Google "OU scug github".) -
A local text editor, choose one of the following:
- Atom (https://atom.io/).
- Notepad++ (https://notepad-plus-plus.org/).
-
Today's as language agnostic as possible. SAS, R, and Python examples in Part 2.
-
Later, consider RegexBuddy for $40.
- There's no single "regular expression" specification. Each language (eg, Python, R, Java) have slightly different flavors.
- There are two main branches of the specification. We'll concentrate on
- the "Perl" branch (eg,
"\d\w"
) instead of - the "Posix" branch (eg,
"[:digit:][:alnum:]"
)
- the "Perl" branch (eg,
- Don't forget the "g" option in regex101.com
grep()
andgrep(..., value=T)
grepl()
sub()
andgsub()
regexpr()
,gregexpr()
,regexec()
- Like most R functions, it's designed for vectors.
- Careful with subtle differences from documentation for other languages.
- Usually, the differences work out for the best.
-
Start with
`import re`
-
search()
--the first match.m = re.search('(?<=a)d', 'ad') m.group(0)
-
findall()
--all non-overlapping matches. -
match()
--careful, needs to be at the start. -
sub()
- Identify years in the 1990s, or
- Identify years ending in 7, 8, or 9, or
- Identify lines starting with a year, or
- "Capture" the century of each year, or
- Think of something else.
1916-1918 subscales for a subject
1998-1914 subscales for a subject
subscales for a subject 1998-1920
s <- c("1916-1918 subscales for a subject", "1898-2003 subscales for a subject", "subscales for a subject 1998-1920")
g <- sub("19(1|2)(\\d)", "20\\1\\2", s)
cat(g, sep="\n")
2016-1918 subscales for a subject
1898-2003 subscales for a subject
subscales for a subject 1998-2020
grep("19(1|2)(\\d)", s)
[1] 1 3
grep("19(1|2)(\\d)", s, value=T)
[1] "1916-1918 subscales for a subject" "subscales for a subject 1998-1920"
grepl("19(1|2)(\\d)", s)
[1] TRUE FALSE TRUE
- Start in a regex tester.
- Start small/simple, then slowly build complexity & generality.
- In the regex.
- In the example/subject text.
- Leave a breadcrumb trail
(ie, progression of simpler regexes, commented out). - Include comments to help others and you later.
- Short-term: use the right-panels of regex101.com.
- Blacklist: Flag the bad numeric values. (Sometimes "bad" is a subjective decision.)
- Whitelist: Permit only the good values.
1234
23
14a
1a3
234
1.39
In the right panel of bottom right panel of regex101.com. Especially these first:
. versus \.
\w and \d and \s (versus \W and \D and \S)
^ and \A
$ and \Z
? and * and + and things like {3,6}
Capturing
Character classes
- Swap the columns.
"CL_ID" = "ClientID"
, "RMSEQ" = "RemovedSequence"
, "RMVL_BEGDT" = "RemovalBeginDate"
, "RMVL_ENDDT" = "RemovalEndDate"
, "END_TYP_CDE" = "EndTypeID"
, "REMOVED_FROM" = "RemovedFromTypeID"
, "CURR_RMVL_TYP" = "RemovalTypeCurrentID"
, "ORIG_RMVL_TYP" = "RemovalTypeOriginalID"
, "FMLY_STRUCTURE" = "FamilyStructureTypeID"
- Anticipate misbehaving subject that you haven't seen yet.
- Try variations of the solutions. There are usually 10 solutions, each with strengths and weaknesses.
- Robustness.
- Readability & maintainability.
- Generality.
- Create very selective regexes that loudly fail when they encounter subject text that you haven't anticipated.
- Long-term: read and rereadRegular Expressions Cookbook, esp Ch 2.
- Erase the "quietly" parameters.
- In a single file
- In many files
requireNamespace("dplyr", quietly=TRUE) #hadley/dplyr
requireNamespace("lubridate")
requireNamespace("OuhscMunge",quietly=T) #OuhscBbmc/OuhscMunge
- Pad single digits with zeros (eg, "4" becomes "04")
9
4
34
3
62
43
1
- Extract the cage ID
- Extract the mouse ID (within the cage)
Time,Gender,Genetype,Treatment,MouseID,OR-Recognition Index,FC-t-F %,FC-b-F %,FC-a-F %
4M,Male,WILD,Control,c9-1,0.32,11.9,0,25.7
4M,Male,WILD,Control,c13-2,0.47,23.7,0,11.
4M,Male,WILD,Prozac,c10-2,0.62,40.7,11.4,51.4
4M,Male,WILD,Prozac,c14-3,0.63,10.2,0,28.6
4M,Male,YFP,Control,c9-2,0.42,42.4,11.4,22.9
4M,Male,YFP,Control,c13-1,0.5,15.3,0,54.1
4M,Male,YFP,Control,c13-nm,1,27.1,0,31.4
4M,Male,YFP,Prozac,c10-1,0.65,20.3,17.1,54.3
(In some cases, you'd have to parse only the cell, not the entire line. But this is good practice.)
- Work by yourself or in pairs.
- After you're done with these 6 exercises,
- Invent new challenges
- Help someone else
- Check the solutions I thought of.
Look for an email invitation to a REDCap survey.
Example 3
(,*\s*)"(\w+)"\s+=\s+"(\w+)" and
$1"$3" = "$2"
Example 4
library\((\w+),\s*quietly=(T|TRUE)\) and
library($1)
Example 5
\b(\d)\b and
0$1
Example 6
,c(\d{1,2})-(\d|nm),