diff --git a/regex/.ipynb_checkpoints/regex-checkpoint.ipynb b/regex/.ipynb_checkpoints/regex-checkpoint.ipynb
index 939620f2..b5ec0429 100644
--- a/regex/.ipynb_checkpoints/regex-checkpoint.ipynb
+++ b/regex/.ipynb_checkpoints/regex-checkpoint.ipynb
@@ -20,33 +20,37 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
- "::: {.callout-note collapse=\"true\"}\n",
+ "::: {.callout-note collapse=\"false\"}\n",
"## Learning Outcomes\n",
- "- Understand Python string manipulation, Pandas Series methods\n",
+ "- Understand Python string manipulation, `pandas` `Series` methods\n",
"- Parse and create regex, with a reference table\n",
- "- Use vocabulary (closure, metacharater, groups, etc.) to describe regex metacharacters\n",
+ "- Use vocabulary (closure, metacharacters, groups, etc.) to describe regex metacharacters\n",
":::\n",
"\n",
+ "**This content is covered in lectures 6 and 7.**\n",
+ "\n",
"## Why Work with Text?\n",
"\n",
- "Last lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data - the primary focus of today's lecture. In this note, we'll discuss the necessary tools to manipulate text: Python string manipulation and regular expressions. \n",
+ "Last lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data — the primary focus of lecture 6. In this note, we'll discuss the necessary tools to manipulate text: Python string manipulation and regular expressions. \n",
"\n",
"There are two main reasons for working with text.\n",
"\n",
"1. Canonicalization: Convert data that has multiple formats into a standard form.\n",
- " - By manipulating text, we can join tables with mismatched string labels\n",
+ " - By manipulating text, we can join tables with mismatched string labels.\n",
+ "\n",
"2. Extract information into a new feature.\n",
- " - For example, we can extract date and time features from text\n",
+ " - For example, we can extract date and time features from text.\n",
"\n",
"## Python String Methods\n",
"\n",
- "First, we'll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and `pandas`. The Python functions operate on a single string, while their equivalent in `pandas` are **vectorized** - they operate on a Series of string data.\n",
+ "First, we'll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and `pandas`. The Python functions operate on a single string, while their equivalent in `pandas` are **vectorized** — they operate on a `Series` of string data.\n",
"\n",
"+-----------------------+-----------------+---------------------------+\n",
- "| Operation | Python | Pandas (Series) |\n",
+ "| Operation | Python | `Pandas` (`Series`) |\n",
"+=======================+=================+===========================+\n",
"| Transformation | - `s.lower(_)` | - `ser.str.lower(_)` |\n",
"| | - `s.upper(_)` | - `ser.str.upper(_)` |\n",
@@ -67,7 +71,7 @@
"| | | |\n",
"+-----------------------+-----------------+---------------------------+\n",
"\n",
- "We'll discuss the differences between Python string functions and `pandas` Series methods in the following section on canonicalization.\n",
+ "We'll discuss the differences between Python string functions and `pandas` `Series` methods in the following section on canonicalization.\n",
"\n",
"### Canonicalization\n",
"Assume we want to merge the given tables."
@@ -75,7 +79,7 @@
},
{
"cell_type": "code",
- "execution_count": 2,
+ "execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
@@ -91,7 +95,7 @@
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 2,
"metadata": {},
"outputs": [
{
@@ -225,7 +229,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Last time, we used a **primary key** and **foreign key** to join two tables. While neither of these keys exist in our DataFrames, the `County` columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables? \n",
+ "Last time, we used a **primary key** and **foreign key** to join two tables. While neither of these keys exist in our `DataFrame`s, the `\"County\"` columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables? \n",
"\n",
"#### Canonicalization with Python String Manipulation\n",
"\n",
@@ -234,7 +238,7 @@
},
{
"cell_type": "code",
- "execution_count": 4,
+ "execution_count": 3,
"metadata": {},
"outputs": [
{
@@ -243,7 +247,7 @@
"'stjohnthebaptist'"
]
},
- "execution_count": 4,
+ "execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@@ -267,12 +271,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We will use the `pandas` `map` function to apply the `canonicalize_county` function to every row in both DataFrames. In doing so, we'll create a new column in each called `clean_county_python` with the canonical form."
+ "We will use the `pandas` `map` function to apply the `canonicalize_county` function to every row in both `DataFrame`s. In doing so, we'll create a new column in each called `clean_county_python` with the canonical form."
]
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
@@ -282,7 +286,7 @@
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 5,
"metadata": {},
"outputs": [
{
@@ -428,16 +432,25 @@
"source": [
"#### Canonicalization with Pandas Series Methods\n",
"\n",
- "Alternatively, we can use `pandas` Series methods to create this standardized column. To do so, we must call the `.str` attribute of our Series object prior to calling any methods, like `.lower` and `.replace`. Notice how these method names match their equivalent built-in Python string functions.\n",
+ "Alternatively, we can use `pandas` `Series` methods to create this standardized column. To do so, we must call the `.str` attribute of our `Series` object prior to calling any methods, like `.lower` and `.replace`. Notice how these method names match their equivalent built-in Python string functions.\n",
"\n",
- "Chaining multiple Series methods in this manner eliminates the need to use the `map` function (as this code is vectorized)."
+ "Chaining multiple `Series` methods in this manner eliminates the need to use the `map` function (as this code is vectorized)."
]
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 6,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\yashd\\AppData\\Local\\Temp\\ipykernel_22720\\837021704.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n",
+ " county_series\n"
+ ]
+ }
+ ],
"source": [
"def canonicalize_county_series(county_series):\n",
" return (\n",
@@ -456,7 +469,7 @@
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 7,
"metadata": {},
"outputs": [
{
@@ -619,7 +632,7 @@
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
@@ -629,7 +642,7 @@
},
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 9,
"metadata": {},
"outputs": [
{
@@ -640,7 +653,7 @@
" '169.237.46.240 - \"\" [3/Feb/2006:10:18:37 -0800] \"GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1\"\\n']"
]
},
- "execution_count": 10,
+ "execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
@@ -653,14 +666,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won't work.\n",
+ "Suppose we want to extract the day, month, year, hour, minutes, seconds, and time zone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won't work.\n",
"\n",
"Instead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further seperated by `/` and `:`. We can hone in on this region of text, and split the data on these characters. Python's built-in `.split` function makes this easy."
]
},
{
"cell_type": "code",
- "execution_count": 11,
+ "execution_count": 10,
"metadata": {},
"outputs": [
{
@@ -669,7 +682,7 @@
"('26', 'Jan', '2014', '10', '47', '58', '-0800')"
]
},
- "execution_count": 11,
+ "execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
@@ -690,9 +703,9 @@
"source": [
"There are two problems with this code:\n",
"\n",
- "1. Python's built-in functions limit us to extract data one record at a time\n",
- " - This can be resolved using a map function or Pandas Series methods.\n",
- "2. The code is quite verbose\n",
+ "1. Python's built-in functions limit us to extract data one record at a time,\n",
+ " - This can be resolved using the `map` function or `pandas` `Series` methods.\n",
+ "2. The code is quite verbose.\n",
" - This is a larger issue that is trickier to solve\n",
"\n",
"In the next section, we'll introduce regular expressions - a tool that solves problem 2.\n",
@@ -706,9 +719,20 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 11,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'[0-9]{3}-[0-9]{2}-[0-9]{4}'"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
"r\"[0-9]{3}-[0-9]{2}-[0-9]{4}\" # Regular Expression Syntax\n",
"\n",
@@ -740,12 +764,12 @@
"+-----------------------+-----------------+----------------+-------------+-------------------+\n",
"| Operation | Order | Syntax Example | Matches | Doesn't Match | \n",
"+=======================+=================+================+=============+===================+\n",
- "| `Concatenation` | 3 | AABAAB | AABAAB | every other string|\n",
- "| | | | | |\n",
- "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
"| `Or`: `|` | 4 | AA|BAAB | AA | every other string|\n",
"| | | | BAAB | |\n",
"+-----------------------+-----------------+----------------+-------------+-------------------+\n",
+ "| `Concatenation` | 3 | AABAAB | AABAAB | every other string|\n",
+ "| | | | | |\n",
+ "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
"| `Closure`: `*` | 2 | AB*A | AA | AB |\n",
"| (zero or more) | | | ABBBBBBA | ABABA |\n",
"+-----------------------+-----------------+----------------+-------------+-------------------+\n",
@@ -776,7 +800,7 @@
"- Notice how the outer capture group surrounds the `|`. \n",
" - Consider the regex `m(uu(uu)*)|(oo(oo)*)n`. This incorrectly matches `muu` and `oooon`. \n",
" - Each OR clause is everything to the left and right of `|`. The incorrect solution matches only half of the string, and ignores either the beginning `m` or trailing `n`.\n",
- " - A set of paranthesis must surround `|`. That way, each OR clause is everything to the left and right of `|` **within** the group. This ensures both the beginning `m` *and* trailing `n` are matched.\n",
+ " - A set of parenthesis must surround `|`. That way, each OR clause is everything to the left and right of `|` **within** the group. This ensures both the beginning `m` *and* trailing `n` are matched.\n",
"\n",
"## Regex Expanded\n",
"\n",
@@ -845,10 +869,10 @@
"\n",
"1. `.*SPB.*` only matches strings that contain the substring `SPB`.\n",
" - The `.*` metacharacter matches any amount of non-negative characters. Newlines do not count. \n",
- "2. This regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit\n",
- " - You'll recognize this as the familiar Social Security Number regular expression\n",
+ "2. This regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.\n",
+ " - You'll recognize this as the familiar Social Security Number regular expression.\n",
"3. Matches any email with a `com` or `edu` domain, where all characters of the email are letters.\n",
- " - At least one `.` must preceed the domain name. Including a backslash `\\` before any metacharacter (in this case, the `.`) tells regex to match that character exactly.\n",
+ " - At least one `.` must precede the domain name. Including a backslash `\\` before any metacharacter (in this case, the `.`) tells RegEx to match that character exactly.\n",
"\n",
"## Convenient Regex\n",
"\n",
@@ -880,16 +904,44 @@
"| | | | | \n",
"+------------------------------------------------+-----------------+----------------+------------------+\n",
"\n",
- "#### Examples\n",
+ "### Greediness\n",
+ "\n",
+ "In order to fully understand the last operation in the table, we have to discuss greediness. RegEx is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern `
.*
`. Given the sentence below, we would hope that the bolded portions would be matched:\n",
+ "\n",
+ "\"This is a **\\
example\\<\\/div>** of greediness \\
in\\<\\/div> regular expressions.”\n",
+ "\"\n",
+ "\n",
+ "In actuality, the way RegEx processes the text given that pattern is as follows:\n",
+ "\n",
+ "1. \"Look for the exact string \\<\\div>\" \n",
+ "\n",
+ "2. then, “look for any character 0 or more times\" \n",
+ "\n",
+ "3. then, “look for the exact string \\<\\/div>\"\n",
+ "\n",
+ "The result would be all the characters starting from the leftmost \\
and the rightmost \\<\\/div> (inclusive). We can fix this making our the pattern non-greedy, `
.*?
`. You can read up more on the documentation [here](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy).\n",
+ "\n",
+ "### Examples\n",
"\n",
"Let's revist our earlier problem of extracting date/time data from the given `.txt` files. Here is how the data looked."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 12,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
"log_lines[0]"
]
@@ -898,11 +950,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "**Question**: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and timezone.\n",
+ "**Question**: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and time zone.\n",
"\n",
"**Answer**: `\\[.*\\]`\n",
"\n",
- "- Notice how matching the literal `[` and `]` is necessary. Therefore, an escape character `\\` is required before both `[` and `]` - otherwise these metacharacters will match character classes. \n",
+ "- Notice how matching the literal `[` and `]` is necessary. Therefore, an escape character `\\` is required before both `[` and `]` — otherwise these metacharacters will match character classes. \n",
"- We need to match a particular format between `[` and `]`. For this example, `.*` will suffice.\n",
"\n",
"**Alternative Solution**: `\\[\\w+/\\w+/\\w+:\\w+:\\w+:\\w+\\s-\\w+\\]`\n",
@@ -917,18 +969,36 @@
"\n",
"#### Canonicalization with Regex\n",
"\n",
- "Earlier in this note, we examined the process of canonicalization using Python string manipulation and `pandas` Series methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let's fix this.\n",
+ "Earlier in this note, we examined the process of canonicalization using Python string manipulation and `pandas` `Series` methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let's fix this.\n",
+ "\n",
+ "To do so, we need to understand a few functions in the `re` module. The first of these is the substitute function: `re.sub(pattern, rep1, text)`. It behaves similarly to Python's built-in `.replace` function, and returns text with all instances of `pattern` replaced by `rep1`. \n",
"\n",
- "To do so, we need to understand a few functions in the `re` module. The first of these is the substitute function: `re.sub(pattern, rep1, text)`. It behaves similarily to Python's built-in `.replace` function, and returns text with all instances of `pattern` replaced by `rep1`. \n",
+ "The regular expression here removes text surrounded by `<>` (also known as HTML tags).\n",
"\n",
- "The regular expression here removes text surrounded by `<>` (also known as HTML tags)."
+ "In order, the pattern matches ... \n",
+ "1. a single `<`\n",
+ "2. any character that is not a `>` : div, td valign..., /td, /div\n",
+ "3. a single `>`\n",
+ "\n",
+ "Any substring in `text` that fulfills all three conditions will be replaced by `''`."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 13,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Moo'"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
"import re\n",
"\n",
@@ -941,20 +1011,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Notice the `r` preceeding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (ie the Python newline metacharacter `\\n`). This makes them useful for regular expressions, which often contain literal `\\` characters.\n",
+ "Notice the `r` preceding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (i.e., the Python newline metacharacter `\\n`). This makes them useful for regular expressions, which often contain literal `\\` characters.\n",
"\n",
- "In other words, don't forget to tag your regex with a `r`.\n",
+ "In other words, don't forget to tag your RegEx with an `r`.\n",
"\n",
"#### Canonicalization with Pandas\n",
"\n",
- "We can also use regular expressions with Pandas Series methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: `ser.str.replace(pattern, repl, regex=True`).\n",
+ "We can also use regular expressions with `pandas` `Series` methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: `ser.str.replace(pattern, repl, regex=True`).\n",
"\n",
- "Consider the following DataFrame `html_data` with a single column."
+ "Consider the following `DataFrame` `html_data` with a single column."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
@@ -967,18 +1037,85 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 15,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
C:\Users\yashd\AppData\Local\Temp\ipykernel_35008\837021704.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
- county_series
+
/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_43617/837021704.py:3: FutureWarning:
+
+The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
+
+/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_43617/837021704.py:3: FutureWarning:
+
+The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
+
@@ -929,20 +934,8 @@
Convenient Regex
-
-
Greediness
-
In order to fully understand the last operation in the table, we have to discuss greediness. Regex is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern <div>.*</div>. Given the sentence below, we would hope that the bolded portions would be matched:
-
“This is a <div>example</div> of greediness <div>in</div> regular expressions.” ”
-
In actuality, the way RegEx processes the text given that pattern is as follows:
-
-
“Look for the exact string <>”
-
then, “look for any character 0 or more times”
-
then, “look for the exact string </div>”
-
-
The result would be all the characters starting from the leftmost <div> and the rightmost </div> (inclusive). We can fix this making our the pattern non-greedy, <div>.*?</div>. You can read up more on the documentation here.
-
-
-
Examples
+
+
Examples
Let’s revist our earlier problem of extracting date/time data from the given .txt files. Here is how the data looked.
log_lines[0]
@@ -1122,111 +1115,6 @@
Extraction with Pan
This function returns a list for every row containing the pattern matches in a given string.
-
As you may expect, there are similar pandas equivalents for other re functions as well. Series.str.extract takes in a pattern and returns a DataFrame of each capture group’s first match in the string. In contrast, Series.str.extractall returns a multi-indexed DataFrame of all matches for each capture group. You can see the difference in the outputs below:
Notice how the given pattern has 3 capture groups, each specified by the regular expression (\d\d). We then use re.findall to return these capture groups, each as tuples containing 3 matches.
These regular expression capture groups can be different. We can use the (\d{2}) shorthand to extract the same data.
Ultimately, the goal is not to memorize all of regular expressions. Rather, the aim is to:
-
-
Understand what regex is capable of.
-
Parse and create regex, with a reference table
-
Use vocabulary (metacharacter, escape character, groups, etc.) to describe regex metacharacters.
-
Differentiate between (), [], {}
-
Design your own character classes with , , […-…], ^, etc.
-
Use Python and pandas regex methods.
-
diff --git a/regex/regex.ipynb b/regex/regex.ipynb
index 11275588..b5ec0429 100644
--- a/regex/regex.ipynb
+++ b/regex/regex.ipynb
@@ -1,1633 +1,1634 @@
{
- "cells": [
- {
- "cell_type": "raw",
- "metadata": {},
- "source": [
- "---\n",
- "title: Regular Expressions\n",
- "format:\n",
- " html:\n",
- " toc: true\n",
- " toc-depth: 5\n",
- " toc-location: right\n",
- " code-fold: false\n",
- " theme:\n",
- " - cosmo\n",
- " - cerulean\n",
- " callout-icon: false\n",
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "::: {.callout-note collapse=\"true\"}\n",
- "## Learning Outcomes\n",
- "- Understand Python string manipulation, Pandas Series methods\n",
- "- Parse and create regex, with a reference table\n",
- "- Use vocabulary (closure, metacharater, groups, etc.) to describe regex metacharacters\n",
- ":::\n",
- "\n",
- "**This content is covered in lectures 6 and 7.**\n",
- "\n",
- "## Why Work with Text?\n",
- "\n",
- "Last lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data - the primary focus of today's lecture. In this note, we'll discuss the necessary tools to manipulate text: Python string manipulation and regular expressions. \n",
- "\n",
- "There are two main reasons for working with text.\n",
- "\n",
- "1. Canonicalization: Convert data that has multiple formats into a standard form.\n",
- " - By manipulating text, we can join tables with mismatched string labels\n",
- "\n",
- "2. Extract information into a new feature.\n",
- " - For example, we can extract date and time features from text\n",
- "\n",
- "## Python String Methods\n",
- "\n",
- "First, we'll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and `pandas`. The Python functions operate on a single string, while their equivalent in `pandas` are **vectorized** - they operate on a Series of string data.\n",
- "\n",
- "+-----------------------+-----------------+---------------------------+\n",
- "| Operation | Python | Pandas (Series) |\n",
- "+=======================+=================+===========================+\n",
- "| Transformation | - `s.lower(_)` | - `ser.str.lower(_)` |\n",
- "| | - `s.upper(_)` | - `ser.str.upper(_)` |\n",
- "+-----------------------+-----------------+---------------------------+\n",
- "| Replacement + Deletion| - `s.replace(_)`| - `ser.str.replace(_)` |\n",
- "| | | |\n",
- "+-----------------------+-----------------+---------------------------+\n",
- "| Split | - `s.split(_)` | - `ser.str.split(_)` |\n",
- "| | | |\n",
- "+-----------------------+-----------------+---------------------------+\n",
- "| Substring | - `s[1:4]` | - `ser.str[1:4]` |\n",
- "| | | |\n",
- "+-----------------------+-----------------+---------------------------+\n",
- "| Membership | - `'_' in s` | - `ser.str.contains(_)` |\n",
- "| | | |\n",
- "+-----------------------+-----------------+---------------------------+\n",
- "| Length | - `len(s)` | - `ser.str.len()` |\n",
- "| | | |\n",
- "+-----------------------+-----------------+---------------------------+\n",
- "\n",
- "We'll discuss the differences between Python string functions and `pandas` Series methods in the following section on canonicalization.\n",
- "\n",
- "### Canonicalization\n",
- "Assume we want to merge the given tables."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "#| code-fold: true\n",
- "import pandas as pd\n",
- "\n",
- "with open('data/county_and_state.csv') as f:\n",
- " county_and_state = pd.read_csv(f)\n",
- " \n",
- "with open('data/county_and_population.csv') as f:\n",
- " county_and_pop = pd.read_csv(f)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
County
\n",
- "
State
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
De Witt County
\n",
- "
IL
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
Lac qui Parle County
\n",
- "
MN
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
Lewis and Clark County
\n",
- "
MT
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
St John the Baptist Parish
\n",
- "
LS
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " County State\n",
- "0 De Witt County IL\n",
- "1 Lac qui Parle County MN\n",
- "2 Lewis and Clark County MT\n",
- "3 St John the Baptist Parish LS"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
County
\n",
- "
Population
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
DeWitt
\n",
- "
16798
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
Lac Qui Parle
\n",
- "
8067
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
Lewis & Clark
\n",
- "
55716
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
St. John the Baptist
\n",
- "
43044
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " County Population\n",
- "0 DeWitt 16798\n",
- "1 Lac Qui Parle 8067\n",
- "2 Lewis & Clark 55716\n",
- "3 St. John the Baptist 43044"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "display(county_and_state), display(county_and_pop);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Last time, we used a **primary key** and **foreign key** to join two tables. While neither of these keys exist in our DataFrames, the `County` columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables? \n",
- "\n",
- "#### Canonicalization with Python String Manipulation\n",
- "\n",
- "The following function uses Python string manipulation to convert a single county name into canonical form. It does so by eliminating whitespace, punctuation, and unnecessary text. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'stjohnthebaptist'"
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "def canonicalize_county(county_name):\n",
- " return (\n",
- " county_name\n",
- " .lower()\n",
- " .replace(' ', '')\n",
- " .replace('&', 'and')\n",
- " .replace('.', '')\n",
- " .replace('county', '')\n",
- " .replace('parish', '')\n",
- " )\n",
- "\n",
- "canonicalize_county(\"St. John the Baptist\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We will use the `pandas` `map` function to apply the `canonicalize_county` function to every row in both DataFrames. In doing so, we'll create a new column in each called `clean_county_python` with the canonical form."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "county_and_pop['clean_county_python'] = county_and_pop['County'].map(canonicalize_county)\n",
- "county_and_state['clean_county_python'] = county_and_state['County'].map(canonicalize_county)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
County
\n",
- "
State
\n",
- "
clean_county_python
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
De Witt County
\n",
- "
IL
\n",
- "
dewitt
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
Lac qui Parle County
\n",
- "
MN
\n",
- "
lacquiparle
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
Lewis and Clark County
\n",
- "
MT
\n",
- "
lewisandclark
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
St John the Baptist Parish
\n",
- "
LS
\n",
- "
stjohnthebaptist
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " County State clean_county_python\n",
- "0 De Witt County IL dewitt\n",
- "1 Lac qui Parle County MN lacquiparle\n",
- "2 Lewis and Clark County MT lewisandclark\n",
- "3 St John the Baptist Parish LS stjohnthebaptist"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
County
\n",
- "
Population
\n",
- "
clean_county_python
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
DeWitt
\n",
- "
16798
\n",
- "
dewitt
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
Lac Qui Parle
\n",
- "
8067
\n",
- "
lacquiparle
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
Lewis & Clark
\n",
- "
55716
\n",
- "
lewisandclark
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
St. John the Baptist
\n",
- "
43044
\n",
- "
stjohnthebaptist
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " County Population clean_county_python\n",
- "0 DeWitt 16798 dewitt\n",
- "1 Lac Qui Parle 8067 lacquiparle\n",
- "2 Lewis & Clark 55716 lewisandclark\n",
- "3 St. John the Baptist 43044 stjohnthebaptist"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "display(county_and_state), display(county_and_pop);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Canonicalization with Pandas Series Methods\n",
- "\n",
- "Alternatively, we can use `pandas` Series methods to create this standardized column. To do so, we must call the `.str` attribute of our Series object prior to calling any methods, like `.lower` and `.replace`. Notice how these method names match their equivalent built-in Python string functions.\n",
- "\n",
- "Chaining multiple Series methods in this manner eliminates the need to use the `map` function (as this code is vectorized)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "C:\\Users\\yashd\\AppData\\Local\\Temp\\ipykernel_22720\\837021704.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n",
- " county_series\n"
- ]
- }
- ],
- "source": [
- "def canonicalize_county_series(county_series):\n",
- " return (\n",
- " county_series\n",
- " .str.lower()\n",
- " .str.replace(' ', '')\n",
- " .str.replace('&', 'and')\n",
- " .str.replace('.', '')\n",
- " .str.replace('county', '')\n",
- " .str.replace('parish', '')\n",
- " )\n",
- "\n",
- "county_and_pop['clean_county_pandas'] = canonicalize_county_series(county_and_pop['County'])\n",
- "county_and_state['clean_county_pandas'] = canonicalize_county_series(county_and_state['County'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
County
\n",
- "
Population
\n",
- "
clean_county_python
\n",
- "
clean_county_pandas
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
DeWitt
\n",
- "
16798
\n",
- "
dewitt
\n",
- "
dewitt
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
Lac Qui Parle
\n",
- "
8067
\n",
- "
lacquiparle
\n",
- "
lacquiparle
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
Lewis & Clark
\n",
- "
55716
\n",
- "
lewisandclark
\n",
- "
lewisandclark
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
St. John the Baptist
\n",
- "
43044
\n",
- "
stjohnthebaptist
\n",
- "
stjohnthebaptist
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " County Population clean_county_python clean_county_pandas\n",
- "0 DeWitt 16798 dewitt dewitt\n",
- "1 Lac Qui Parle 8067 lacquiparle lacquiparle\n",
- "2 Lewis & Clark 55716 lewisandclark lewisandclark\n",
- "3 St. John the Baptist 43044 stjohnthebaptist stjohnthebaptist"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
County
\n",
- "
State
\n",
- "
clean_county_python
\n",
- "
clean_county_pandas
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
De Witt County
\n",
- "
IL
\n",
- "
dewitt
\n",
- "
dewitt
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
Lac qui Parle County
\n",
- "
MN
\n",
- "
lacquiparle
\n",
- "
lacquiparle
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
Lewis and Clark County
\n",
- "
MT
\n",
- "
lewisandclark
\n",
- "
lewisandclark
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
St John the Baptist Parish
\n",
- "
LS
\n",
- "
stjohnthebaptist
\n",
- "
stjohnthebaptist
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " County State clean_county_python clean_county_pandas\n",
- "0 De Witt County IL dewitt dewitt\n",
- "1 Lac qui Parle County MN lacquiparle lacquiparle\n",
- "2 Lewis and Clark County MT lewisandclark lewisandclark\n",
- "3 St John the Baptist Parish LS stjohnthebaptist stjohnthebaptist"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "display(county_and_pop), display(county_and_state);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Extraction\n",
- "\n",
- "Extraction explores the idea of obtaining useful information from text data. This will be particularily important in model building, which we'll study in a few weeks.\n",
- "\n",
- "Say we want to read some data from a `.txt` file."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [],
- "source": [
- "with open('data/log.txt', 'r') as f:\n",
- " log_lines = f.readlines()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n',\n",
- " '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] \"GET /stat141/Notes/dim.html HTTP/1.0\" 404 302 \"http://eeyore.ucdavis.edu/stat141/Notes/session.html\"\\n',\n",
- " '169.237.46.240 - \"\" [3/Feb/2006:10:18:37 -0800] \"GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1\"\\n']"
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "log_lines"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won't work.\n",
- "\n",
- "Instead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further seperated by `/` and `:`. We can hone in on this region of text, and split the data on these characters. Python's built-in `.split` function makes this easy."
- ]
- },
+ "cells": [
+ {
+ "cell_type": "raw",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "title: Regular Expressions\n",
+ "format:\n",
+ " html:\n",
+ " toc: true\n",
+ " toc-depth: 5\n",
+ " toc-location: right\n",
+ " code-fold: false\n",
+ " theme:\n",
+ " - cosmo\n",
+ " - cerulean\n",
+ " callout-icon: false\n",
+ "---"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "::: {.callout-note collapse=\"false\"}\n",
+ "## Learning Outcomes\n",
+ "- Understand Python string manipulation, `pandas` `Series` methods\n",
+ "- Parse and create regex, with a reference table\n",
+ "- Use vocabulary (closure, metacharacters, groups, etc.) to describe regex metacharacters\n",
+ ":::\n",
+ "\n",
+ "**This content is covered in lectures 6 and 7.**\n",
+ "\n",
+ "## Why Work with Text?\n",
+ "\n",
+ "Last lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data — the primary focus of lecture 6. In this note, we'll discuss the necessary tools to manipulate text: Python string manipulation and regular expressions. \n",
+ "\n",
+ "There are two main reasons for working with text.\n",
+ "\n",
+ "1. Canonicalization: Convert data that has multiple formats into a standard form.\n",
+ " - By manipulating text, we can join tables with mismatched string labels.\n",
+ "\n",
+ "2. Extract information into a new feature.\n",
+ " - For example, we can extract date and time features from text.\n",
+ "\n",
+ "## Python String Methods\n",
+ "\n",
+ "First, we'll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and `pandas`. The Python functions operate on a single string, while their equivalent in `pandas` are **vectorized** — they operate on a `Series` of string data.\n",
+ "\n",
+ "+-----------------------+-----------------+---------------------------+\n",
+ "| Operation | Python | `Pandas` (`Series`) |\n",
+ "+=======================+=================+===========================+\n",
+ "| Transformation | - `s.lower(_)` | - `ser.str.lower(_)` |\n",
+ "| | - `s.upper(_)` | - `ser.str.upper(_)` |\n",
+ "+-----------------------+-----------------+---------------------------+\n",
+ "| Replacement + Deletion| - `s.replace(_)`| - `ser.str.replace(_)` |\n",
+ "| | | |\n",
+ "+-----------------------+-----------------+---------------------------+\n",
+ "| Split | - `s.split(_)` | - `ser.str.split(_)` |\n",
+ "| | | |\n",
+ "+-----------------------+-----------------+---------------------------+\n",
+ "| Substring | - `s[1:4]` | - `ser.str[1:4]` |\n",
+ "| | | |\n",
+ "+-----------------------+-----------------+---------------------------+\n",
+ "| Membership | - `'_' in s` | - `ser.str.contains(_)` |\n",
+ "| | | |\n",
+ "+-----------------------+-----------------+---------------------------+\n",
+ "| Length | - `len(s)` | - `ser.str.len()` |\n",
+ "| | | |\n",
+ "+-----------------------+-----------------+---------------------------+\n",
+ "\n",
+ "We'll discuss the differences between Python string functions and `pandas` `Series` methods in the following section on canonicalization.\n",
+ "\n",
+ "### Canonicalization\n",
+ "Assume we want to merge the given tables."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#| code-fold: true\n",
+ "import pandas as pd\n",
+ "\n",
+ "with open('data/county_and_state.csv') as f:\n",
+ " county_and_state = pd.read_csv(f)\n",
+ " \n",
+ "with open('data/county_and_population.csv') as f:\n",
+ " county_and_pop = pd.read_csv(f)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "('26', 'Jan', '2014', '10', '47', '58', '-0800')"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
County
\n",
+ "
State
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
De Witt County
\n",
+ "
IL
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
Lac qui Parle County
\n",
+ "
MN
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
Lewis and Clark County
\n",
+ "
MT
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
St John the Baptist Parish
\n",
+ "
LS
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
],
- "source": [
- "first = log_lines[0] # Only considering the first row of data\n",
- "\n",
- "pertinent = first.split(\"[\")[1].split(']')[0]\n",
- "day, month, rest = pertinent.split('/')\n",
- "year, hour, minute, rest = rest.split(':')\n",
- "seconds, time_zone = rest.split(' ')\n",
- "day, month, year, hour, minute, seconds, time_zone"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "There are two problems with this code:\n",
- "\n",
- "1. Python's built-in functions limit us to extract data one record at a time\n",
- " - This can be resolved using a map function or Pandas Series methods.\n",
- "2. The code is quite verbose\n",
- " - This is a larger issue that is trickier to solve\n",
- "\n",
- "In the next section, we'll introduce regular expressions - a tool that solves problem 2.\n",
- "\n",
- "## Regex Basics\n",
- "\n",
- "A **regular expression (\"regex\")** is a sequence of characters that specifies a search pattern. They are written to extract specific information from text. Regular expressions are essentially part of a smaller programming language embedded in Python, made available through the `re` module. As such, they have a stand-alone syntax and methods for various capabilities.\n",
- "\n",
- "Regular expressions are useful in many applications beyond data science. For example, Social Security Numbers (SSNs) are often validated with regular expresions."
+ "text/plain": [
+ " County State\n",
+ "0 De Witt County IL\n",
+ "1 Lac qui Parle County MN\n",
+ "2 Lewis and Clark County MT\n",
+ "3 St John the Baptist Parish LS"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'[0-9]{3}-[0-9]{2}-[0-9]{4}'"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
County
\n",
+ "
Population
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
DeWitt
\n",
+ "
16798
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
Lac Qui Parle
\n",
+ "
8067
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
Lewis & Clark
\n",
+ "
55716
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
St. John the Baptist
\n",
+ "
43044
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
],
- "source": [
- "r\"[0-9]{3}-[0-9]{2}-[0-9]{4}\" # Regular Expression Syntax\n",
- "\n",
- "# 3 of any digit, then a dash,\n",
- "# then 2 of any digit, then a dash,\n",
- "# then 4 of any digit"
+ "text/plain": [
+ " County Population\n",
+ "0 DeWitt 16798\n",
+ "1 Lac Qui Parle 8067\n",
+ "2 Lewis & Clark 55716\n",
+ "3 St. John the Baptist 43044"
]
- },
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "display(county_and_state), display(county_and_pop);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Last time, we used a **primary key** and **foreign key** to join two tables. While neither of these keys exist in our `DataFrame`s, the `\"County\"` columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables? \n",
+ "\n",
+ "#### Canonicalization with Python String Manipulation\n",
+ "\n",
+ "The following function uses Python string manipulation to convert a single county name into canonical form. It does so by eliminating whitespace, punctuation, and unnecessary text. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "There are a ton of resources to learn and experiment with regular expressions. A few are provided below:\n",
- "\n",
- "- [Official Regex Guide](https://docs.python.org/3/howto/regex.html)\n",
- "- [Data 100 Reference Sheet](https://ds100.org/sp22/resources/assets/hw/regex_reference.pdf) \n",
- "- [Regex101.com](https://regex101.com/)\n",
- " - Be sure to check `Python` under the category on the left.\n",
- "\n",
- "### Basics Regex Syntax\n",
- "\n",
- "There are four basic operations with regular expressions.\n",
- "\n",
- "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
- "| Operation | Order | Syntax Example | Matches | Doesn't Match | \n",
- "+=======================+=================+================+=============+===================+\n",
- "| `Or`: `|` | 4 | AA|BAAB | AA | every other string|\n",
- "| | | | BAAB | |\n",
- "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
- "| `Concatenation` | 3 | AABAAB | AABAAB | every other string|\n",
- "| | | | | |\n",
- "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
- "| `Closure`: `*` | 2 | AB*A | AA | AB |\n",
- "| (zero or more) | | | ABBBBBBA | ABABA |\n",
- "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
- "| `Group`: `()` | 1 | A(A|B)AAB | AAAAB | every other string|\n",
- "| (parenthesis) | | | ABAAB | |\n",
- "| | | | | |\n",
- "| | | | | |\n",
- "| | | (AB)*A | A | AA |\n",
- "| | | | ABABABABA | ABBA |\n",
- "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
- "\n",
- "Notice how these metacharacter operations are ordered. Rather than being literal characters, these **metacharacters** manipulate adjacent characters. `()` takes precedence, followed by `*`, and finally `|`. This allows us to differentiate between very different regex commands like `AB*` and `(AB)*`. The former reads \"`A` then zero or more copies of `B`\", while the latter specifies \"zero or more copies of `AB`\".\n",
- "\n",
- "#### Examples\n",
- "\n",
- "**Question 1**: Give a regular expression that matches `moon`, `moooon`, etc. Your expression should match any even number of `o`s except zero (i.e. don’t match `mn`).\n",
- "\n",
- "**Answer 1**: `moo(oo)*n`\n",
- "\n",
- "- Hardcoding `oo` before the capture group ensures that `mn` is not matched.\n",
- "- A capture group of `(oo)*` ensures the number of `o`'s is even.\n",
- "\n",
- "**Question 2**: Using only basic operations, formulate a regex that matches `muun`, `muuuun`, `moon`, `moooon`, etc. Your expression should match any even number of `u`s or `o`s except zero (i.e. don’t match `mn`).\n",
- "\n",
- "**Answer 2**: `m(uu(uu)*|oo(oo)*)n`\n",
- "\n",
- "- The leading `m` and trailing `n` ensures that only strings beginning with `m` and ending with `n` are matched.\n",
- "- Notice how the outer capture group surrounds the `|`. \n",
- " - Consider the regex `m(uu(uu)*)|(oo(oo)*)n`. This incorrectly matches `muu` and `oooon`. \n",
- " - Each OR clause is everything to the left and right of `|`. The incorrect solution matches only half of the string, and ignores either the beginning `m` or trailing `n`.\n",
- " - A set of paranthesis must surround `|`. That way, each OR clause is everything to the left and right of `|` **within** the group. This ensures both the beginning `m` *and* trailing `n` are matched.\n",
- "\n",
- "## Regex Expanded\n",
- "\n",
- "Provided below are more complex regular expression functions. \n",
- "\n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| Operation | Syntax Example | Matches |Doesn't Match |\n",
- "+================================================+=================+================+==================+\n",
- "| `Any Character`: `.` | .U.U.U. | CUMULUS | SUCCUBUS |\n",
- "| (except newline) | | JUGULUM | TUMULTUOUS | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `Character Class`: `[]` | [A-Za-z][a-z]* | word | camelCase |\n",
- "| (match one character in `[]`) | | Capitalized | 4illegal | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `Repeated \"a\" Times`: `{a}` | j[aeiou]{3}hn | jaoehn | jhn |\n",
- "| | | jooohn | jaeiouhn |\n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `Repeated \"from a to b\" Times`: `{a, b}` | j[0u]{1,2}hn | john | jhn | \n",
- "| | | juohn | jooohn |\n",
- "| | | | | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `At Least One`: `+` | jo+hn | john | jhn | \n",
- "| | | joooooohn | jjohn |\n",
- "| | | | | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `Zero or One`: `?` | joh?n | jon | any other string | \n",
- "| | | john | |\n",
- "| | | | | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "\n",
- "A character class matches a single character in it's class. These characters can be hardcoded -- in the case of `[aeiou]` -- or shorthand can be specified to mean a range of characters. Examples include:\n",
- "\n",
- "1. `[A-Z]`: Any capitalized letter\n",
- "2. `[a-z]`: Any lowercase letter\n",
- "3. `[0-9]`: Any single digit\n",
- "4. `[A-Za-z]`: Any capitalized of lowercase letter\n",
- "5. `[A-Za-z0-9]`: Any capitalized or lowercase letter or single digit\n",
- "\n",
- "#### Examples\n",
- "\n",
- "Let's analyze a few examples of complex regular expressions.\n",
- "\n",
- "+---------------------------------+---------------------------------+\n",
- "| Matches | Does Not Match |\n",
- "+=================================+=================================+\n",
- "| 1. `.*SPB.*` | | \n",
- "| | | \n",
- "+---------------------------------+---------------------------------+\n",
- "| RASPBERRY | SUBSPACE |\n",
- "| SPBOO | SUBSPECIES |\n",
- "+---------------------------------+---------------------------------+\n",
- "| 2. `[0-9]{3}-[0-9]{2}-[0-9]{4}` | |\n",
- "| | |\n",
- "+---------------------------------+---------------------------------+\n",
- "| 231-41-5121 | 231415121 |\n",
- "| 573-57-1821 | 57-3571821 |\n",
- "+---------------------------------+---------------------------------+\n",
- "| 3. `[a-z]+@([a-z]+\\.)+(edu|com)`| |\n",
- "| | |\n",
- "+---------------------------------+---------------------------------+\n",
- "| horse@pizza.com | frank_99@yahoo.com |\n",
- "| horse@pizza.food.com | hug@cs |\n",
- "+---------------------------------+---------------------------------+\n",
- "\n",
- "**Explanations**\n",
- "\n",
- "1. `.*SPB.*` only matches strings that contain the substring `SPB`.\n",
- " - The `.*` metacharacter matches any amount of non-negative characters. Newlines do not count. \n",
- "2. This regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit\n",
- " - You'll recognize this as the familiar Social Security Number regular expression\n",
- "3. Matches any email with a `com` or `edu` domain, where all characters of the email are letters.\n",
- " - At least one `.` must preceed the domain name. Including a backslash `\\` before any metacharacter (in this case, the `.`) tells regex to match that character exactly.\n",
- "\n",
- "## Convenient Regex\n",
- "\n",
- "Here are a few more convenient regular expressions. \n",
- "\n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| Operation | Syntax Example | Matches |Doesn't Match |\n",
- "+================================================+=================+================+==================+\n",
- "| `built in character class` | `\\w+` | Fawef_03 |this person |\n",
- "| | `\\d+` | 231123 |423 people |\n",
- "| | `\\s+` | `whitespace` | `non-whitespace` |\n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `character class negation`: `[^]` | [^a-z]+. | PEPPERS3982 | porch |\n",
- "| (everything except the given characters) | | 17211!↑å | CLAmS | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `escape character`: `\\` | cow\\\\.com | cow.com | cowscom |\n",
- "| (match the literal next character) | | | |\n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `beginning of line`: `^` | ^ark | ark two | dark | \n",
- "| | | ark o ark | |\n",
- "| | | | | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `end of line`: `$` | ark$ | dark | ark two | \n",
- "| | | ark o ark | |\n",
- "| | | | | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "| `lazy version of zero or more` : `*?` | 5.*?5 | 5005 | 5005005 | \n",
- "| | | 55 | |\n",
- "| | | | | \n",
- "+------------------------------------------------+-----------------+----------------+------------------+\n",
- "\n",
- "### Greediness\n",
- "\n",
- "In order to fully understand the last operation in the table, we have to discuss greediness. Regex is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern `
.*
`. Given the sentence below, we would hope that the bolded portions would be matched:\n",
- "\n",
- "\"This is a **\\
example\\<\\/div>** of greediness \\
in\\<\\/div> regular expressions.”\n",
- "\"\n",
- "\n",
- "In actuality, the way RegEx processes the text given that pattern is as follows:\n",
- "\n",
- "1. \"Look for the exact string \\<\\div>\" \n",
- "\n",
- "2. then, “look for any character 0 or more times\" \n",
- "\n",
- "3. then, “look for the exact string \\<\\/div>\"\n",
- "\n",
- "The result would be all the characters starting from the leftmost \\
and the rightmost \\<\\/div> (inclusive). We can fix this making our the pattern non-greedy, `
.*?
`. You can read up more on the documentation [here](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy).\n",
- "\n",
- "### Examples\n",
- "\n",
- "Let's revist our earlier problem of extracting date/time data from the given `.txt` files. Here is how the data looked."
+ "data": {
+ "text/plain": [
+ "'stjohnthebaptist'"
]
- },
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "def canonicalize_county(county_name):\n",
+ " return (\n",
+ " county_name\n",
+ " .lower()\n",
+ " .replace(' ', '')\n",
+ " .replace('&', 'and')\n",
+ " .replace('.', '')\n",
+ " .replace('county', '')\n",
+ " .replace('parish', '')\n",
+ " )\n",
+ "\n",
+ "canonicalize_county(\"St. John the Baptist\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We will use the `pandas` `map` function to apply the `canonicalize_county` function to every row in both `DataFrame`s. In doing so, we'll create a new column in each called `clean_county_python` with the canonical form."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "county_and_pop['clean_county_python'] = county_and_pop['County'].map(canonicalize_county)\n",
+ "county_and_state['clean_county_python'] = county_and_state['County'].map(canonicalize_county)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
County
\n",
+ "
State
\n",
+ "
clean_county_python
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
De Witt County
\n",
+ "
IL
\n",
+ "
dewitt
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
Lac qui Parle County
\n",
+ "
MN
\n",
+ "
lacquiparle
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
Lewis and Clark County
\n",
+ "
MT
\n",
+ "
lewisandclark
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
St John the Baptist Parish
\n",
+ "
LS
\n",
+ "
stjohnthebaptist
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
],
- "source": [
- "log_lines[0]"
+ "text/plain": [
+ " County State clean_county_python\n",
+ "0 De Witt County IL dewitt\n",
+ "1 Lac qui Parle County MN lacquiparle\n",
+ "2 Lewis and Clark County MT lewisandclark\n",
+ "3 St John the Baptist Parish LS stjohnthebaptist"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Question**: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and timezone.\n",
- "\n",
- "**Answer**: `\\[.*\\]`\n",
- "\n",
- "- Notice how matching the literal `[` and `]` is necessary. Therefore, an escape character `\\` is required before both `[` and `]` - otherwise these metacharacters will match character classes. \n",
- "- We need to match a particular format between `[` and `]`. For this example, `.*` will suffice.\n",
- "\n",
- "**Alternative Solution**: `\\[\\w+/\\w+/\\w+:\\w+:\\w+:\\w+\\s-\\w+\\]`\n",
- "\n",
- "- This solution is much safer. \n",
- " - Imagine the data between `[` and `]` was garbage - `.*` will still match that. \n",
- " - The alternate solution will only match data that follows the correct format.\n",
- "\n",
- "## Regex in Python and Pandas (Regex Groups)\n",
- "\n",
- "### Canonicalization\n",
- "\n",
- "#### Canonicalization with Regex\n",
- "\n",
- "Earlier in this note, we examined the process of canonicalization using Python string manipulation and `pandas` Series methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let's fix this.\n",
- "\n",
- "To do so, we need to understand a few functions in the `re` module. The first of these is the substitute function: `re.sub(pattern, rep1, text)`. It behaves similarily to Python's built-in `.replace` function, and returns text with all instances of `pattern` replaced by `rep1`. \n",
- "\n",
- "The regular expression here removes text surrounded by `<>` (also known as HTML tags).\n",
- "\n",
- "In order, the pattern matches ... \n",
- "1. a single `<`\n",
- "2. any character that is not a `>` : div, td valign..., /td, /div\n",
- "3. a single `>`\n",
- "\n",
- "Any substring in `text` that fulfills all three conditions will be replaced by `''`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'Moo'"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
+ "data": {
+ "text/html": [
+ "
\"\n",
- "pattern = r\"<[^>]+>\"\n",
- "re.sub(pattern, '', text) "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Notice the `r` preceeding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (ie the Python newline metacharacter `\\n`). This makes them useful for regular expressions, which often contain literal `\\` characters.\n",
- "\n",
- "In other words, don't forget to tag your regex with a `r`.\n",
- "\n",
- "#### Canonicalization with Pandas\n",
- "\n",
- "We can also use regular expressions with Pandas Series methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: `ser.str.replace(pattern, repl, regex=True`).\n",
- "\n",
- "Consider the following DataFrame `html_data` with a single column."
+ "text/plain": [
+ " County Population clean_county_python\n",
+ "0 DeWitt 16798 dewitt\n",
+ "1 Lac Qui Parle 8067 lacquiparle\n",
+ "2 Lewis & Clark 55716 lewisandclark\n",
+ "3 St. John the Baptist 43044 stjohnthebaptist"
]
- },
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "display(county_and_state), display(county_and_pop);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Canonicalization with Pandas Series Methods\n",
+ "\n",
+ "Alternatively, we can use `pandas` `Series` methods to create this standardized column. To do so, we must call the `.str` attribute of our `Series` object prior to calling any methods, like `.lower` and `.replace`. Notice how these method names match their equivalent built-in Python string functions.\n",
+ "\n",
+ "Chaining multiple `Series` methods in this manner eliminates the need to use the `map` function (as this code is vectorized)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [],
- "source": [
- "#| code-fold: true\n",
- "data = {\"HTML\": [\"
"
],
- "source": [
- "pattern = r\"<[^>]+>\"\n",
- "html_data['HTML'].str.replace(pattern, '', regex=True)"
+ "text/plain": [
+ " County State clean_county_python clean_county_pandas\n",
+ "0 De Witt County IL dewitt dewitt\n",
+ "1 Lac qui Parle County MN lacquiparle lacquiparle\n",
+ "2 Lewis and Clark County MT lewisandclark lewisandclark\n",
+ "3 St John the Baptist Parish LS stjohnthebaptist stjohnthebaptist"
]
- },
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "display(county_and_pop), display(county_and_state);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Extraction\n",
+ "\n",
+ "Extraction explores the idea of obtaining useful information from text data. This will be particularily important in model building, which we'll study in a few weeks.\n",
+ "\n",
+ "Say we want to read some data from a `.txt` file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "with open('data/log.txt', 'r') as f:\n",
+ " log_lines = f.readlines()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Extraction\n",
- "\n",
- "#### Extraction with Regex\n",
- "\n",
- "Just like with canonicalization, the `re` module provides capability to extract relevant text from a string: `re.findall(pattern, text)`. This function returns a list of all matches to `pattern`. \n",
- "\n",
- "Using the familiar regular expression for Social Security Numbers:"
+ "data": {
+ "text/plain": [
+ "['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n',\n",
+ " '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] \"GET /stat141/Notes/dim.html HTTP/1.0\" 404 302 \"http://eeyore.ucdavis.edu/stat141/Notes/session.html\"\\n',\n",
+ " '169.237.46.240 - \"\" [3/Feb/2006:10:18:37 -0800] \"GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1\"\\n']"
]
- },
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "log_lines"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Suppose we want to extract the day, month, year, hour, minutes, seconds, and time zone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won't work.\n",
+ "\n",
+ "Instead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further seperated by `/` and `:`. We can hone in on this region of text, and split the data on these characters. Python's built-in `.split` function makes this easy."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['123-45-6789', '321-45-6789']"
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text = \"My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789.\"\n",
- "pattern = r\"[0-9]{3}-[0-9]{2}-[0-9]{4}\"\n",
- "re.findall(pattern, text) "
+ "data": {
+ "text/plain": [
+ "('26', 'Jan', '2014', '10', '47', '58', '-0800')"
]
- },
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "first = log_lines[0] # Only considering the first row of data\n",
+ "\n",
+ "pertinent = first.split(\"[\")[1].split(']')[0]\n",
+ "day, month, rest = pertinent.split('/')\n",
+ "year, hour, minute, rest = rest.split(':')\n",
+ "seconds, time_zone = rest.split(' ')\n",
+ "day, month, year, hour, minute, seconds, time_zone"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There are two problems with this code:\n",
+ "\n",
+ "1. Python's built-in functions limit us to extract data one record at a time,\n",
+ " - This can be resolved using the `map` function or `pandas` `Series` methods.\n",
+ "2. The code is quite verbose.\n",
+ " - This is a larger issue that is trickier to solve\n",
+ "\n",
+ "In the next section, we'll introduce regular expressions - a tool that solves problem 2.\n",
+ "\n",
+ "## Regex Basics\n",
+ "\n",
+ "A **regular expression (\"regex\")** is a sequence of characters that specifies a search pattern. They are written to extract specific information from text. Regular expressions are essentially part of a smaller programming language embedded in Python, made available through the `re` module. As such, they have a stand-alone syntax and methods for various capabilities.\n",
+ "\n",
+ "Regular expressions are useful in many applications beyond data science. For example, Social Security Numbers (SSNs) are often validated with regular expresions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Extraction with Pandas\n",
- "\n",
- "Pandas similarily provides extraction functionality on a Series of data: `ser.str.findall(pattern)`\n",
- "\n",
- "Consider the following DataFrame `ssn_data`."
+ "data": {
+ "text/plain": [
+ "'[0-9]{3}-[0-9]{2}-[0-9]{4}'"
]
- },
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "r\"[0-9]{3}-[0-9]{2}-[0-9]{4}\" # Regular Expression Syntax\n",
+ "\n",
+ "# 3 of any digit, then a dash,\n",
+ "# then 2 of any digit, then a dash,\n",
+ "# then 4 of any digit"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "There are a ton of resources to learn and experiment with regular expressions. A few are provided below:\n",
+ "\n",
+ "- [Official Regex Guide](https://docs.python.org/3/howto/regex.html)\n",
+ "- [Data 100 Reference Sheet](https://ds100.org/sp22/resources/assets/hw/regex_reference.pdf) \n",
+ "- [Regex101.com](https://regex101.com/)\n",
+ " - Be sure to check `Python` under the category on the left.\n",
+ "\n",
+ "### Basics Regex Syntax\n",
+ "\n",
+ "There are four basic operations with regular expressions.\n",
+ "\n",
+ "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
+ "| Operation | Order | Syntax Example | Matches | Doesn't Match | \n",
+ "+=======================+=================+================+=============+===================+\n",
+ "| `Or`: `|` | 4 | AA|BAAB | AA | every other string|\n",
+ "| | | | BAAB | |\n",
+ "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
+ "| `Concatenation` | 3 | AABAAB | AABAAB | every other string|\n",
+ "| | | | | |\n",
+ "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
+ "| `Closure`: `*` | 2 | AB*A | AA | AB |\n",
+ "| (zero or more) | | | ABBBBBBA | ABABA |\n",
+ "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
+ "| `Group`: `()` | 1 | A(A|B)AAB | AAAAB | every other string|\n",
+ "| (parenthesis) | | | ABAAB | |\n",
+ "| | | | | |\n",
+ "| | | | | |\n",
+ "| | | (AB)*A | A | AA |\n",
+ "| | | | ABABABABA | ABBA |\n",
+ "+-----------------------+-----------------+----------------+-------------+-------------------+\n",
+ "\n",
+ "Notice how these metacharacter operations are ordered. Rather than being literal characters, these **metacharacters** manipulate adjacent characters. `()` takes precedence, followed by `*`, and finally `|`. This allows us to differentiate between very different regex commands like `AB*` and `(AB)*`. The former reads \"`A` then zero or more copies of `B`\", while the latter specifies \"zero or more copies of `AB`\".\n",
+ "\n",
+ "#### Examples\n",
+ "\n",
+ "**Question 1**: Give a regular expression that matches `moon`, `moooon`, etc. Your expression should match any even number of `o`s except zero (i.e. don’t match `mn`).\n",
+ "\n",
+ "**Answer 1**: `moo(oo)*n`\n",
+ "\n",
+ "- Hardcoding `oo` before the capture group ensures that `mn` is not matched.\n",
+ "- A capture group of `(oo)*` ensures the number of `o`'s is even.\n",
+ "\n",
+ "**Question 2**: Using only basic operations, formulate a regex that matches `muun`, `muuuun`, `moon`, `moooon`, etc. Your expression should match any even number of `u`s or `o`s except zero (i.e. don’t match `mn`).\n",
+ "\n",
+ "**Answer 2**: `m(uu(uu)*|oo(oo)*)n`\n",
+ "\n",
+ "- The leading `m` and trailing `n` ensures that only strings beginning with `m` and ending with `n` are matched.\n",
+ "- Notice how the outer capture group surrounds the `|`. \n",
+ " - Consider the regex `m(uu(uu)*)|(oo(oo)*)n`. This incorrectly matches `muu` and `oooon`. \n",
+ " - Each OR clause is everything to the left and right of `|`. The incorrect solution matches only half of the string, and ignores either the beginning `m` or trailing `n`.\n",
+ " - A set of parenthesis must surround `|`. That way, each OR clause is everything to the left and right of `|` **within** the group. This ensures both the beginning `m` *and* trailing `n` are matched.\n",
+ "\n",
+ "## Regex Expanded\n",
+ "\n",
+ "Provided below are more complex regular expression functions. \n",
+ "\n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| Operation | Syntax Example | Matches |Doesn't Match |\n",
+ "+================================================+=================+================+==================+\n",
+ "| `Any Character`: `.` | .U.U.U. | CUMULUS | SUCCUBUS |\n",
+ "| (except newline) | | JUGULUM | TUMULTUOUS | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `Character Class`: `[]` | [A-Za-z][a-z]* | word | camelCase |\n",
+ "| (match one character in `[]`) | | Capitalized | 4illegal | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `Repeated \"a\" Times`: `{a}` | j[aeiou]{3}hn | jaoehn | jhn |\n",
+ "| | | jooohn | jaeiouhn |\n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `Repeated \"from a to b\" Times`: `{a, b}` | j[0u]{1,2}hn | john | jhn | \n",
+ "| | | juohn | jooohn |\n",
+ "| | | | | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `At Least One`: `+` | jo+hn | john | jhn | \n",
+ "| | | joooooohn | jjohn |\n",
+ "| | | | | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `Zero or One`: `?` | joh?n | jon | any other string | \n",
+ "| | | john | |\n",
+ "| | | | | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "\n",
+ "A character class matches a single character in it's class. These characters can be hardcoded -- in the case of `[aeiou]` -- or shorthand can be specified to mean a range of characters. Examples include:\n",
+ "\n",
+ "1. `[A-Z]`: Any capitalized letter\n",
+ "2. `[a-z]`: Any lowercase letter\n",
+ "3. `[0-9]`: Any single digit\n",
+ "4. `[A-Za-z]`: Any capitalized of lowercase letter\n",
+ "5. `[A-Za-z0-9]`: Any capitalized or lowercase letter or single digit\n",
+ "\n",
+ "#### Examples\n",
+ "\n",
+ "Let's analyze a few examples of complex regular expressions.\n",
+ "\n",
+ "+---------------------------------+---------------------------------+\n",
+ "| Matches | Does Not Match |\n",
+ "+=================================+=================================+\n",
+ "| 1. `.*SPB.*` | | \n",
+ "| | | \n",
+ "+---------------------------------+---------------------------------+\n",
+ "| RASPBERRY | SUBSPACE |\n",
+ "| SPBOO | SUBSPECIES |\n",
+ "+---------------------------------+---------------------------------+\n",
+ "| 2. `[0-9]{3}-[0-9]{2}-[0-9]{4}` | |\n",
+ "| | |\n",
+ "+---------------------------------+---------------------------------+\n",
+ "| 231-41-5121 | 231415121 |\n",
+ "| 573-57-1821 | 57-3571821 |\n",
+ "+---------------------------------+---------------------------------+\n",
+ "| 3. `[a-z]+@([a-z]+\\.)+(edu|com)`| |\n",
+ "| | |\n",
+ "+---------------------------------+---------------------------------+\n",
+ "| horse@pizza.com | frank_99@yahoo.com |\n",
+ "| horse@pizza.food.com | hug@cs |\n",
+ "+---------------------------------+---------------------------------+\n",
+ "\n",
+ "**Explanations**\n",
+ "\n",
+ "1. `.*SPB.*` only matches strings that contain the substring `SPB`.\n",
+ " - The `.*` metacharacter matches any amount of non-negative characters. Newlines do not count. \n",
+ "2. This regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.\n",
+ " - You'll recognize this as the familiar Social Security Number regular expression.\n",
+ "3. Matches any email with a `com` or `edu` domain, where all characters of the email are letters.\n",
+ " - At least one `.` must precede the domain name. Including a backslash `\\` before any metacharacter (in this case, the `.`) tells RegEx to match that character exactly.\n",
+ "\n",
+ "## Convenient Regex\n",
+ "\n",
+ "Here are a few more convenient regular expressions. \n",
+ "\n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| Operation | Syntax Example | Matches |Doesn't Match |\n",
+ "+================================================+=================+================+==================+\n",
+ "| `built in character class` | `\\w+` | Fawef_03 |this person |\n",
+ "| | `\\d+` | 231123 |423 people |\n",
+ "| | `\\s+` | `whitespace` | `non-whitespace` |\n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `character class negation`: `[^]` | [^a-z]+. | PEPPERS3982 | porch |\n",
+ "| (everything except the given characters) | | 17211!↑å | CLAmS | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `escape character`: `\\` | cow\\\\.com | cow.com | cowscom |\n",
+ "| (match the literal next character) | | | |\n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `beginning of line`: `^` | ^ark | ark two | dark | \n",
+ "| | | ark o ark | |\n",
+ "| | | | | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `end of line`: `$` | ark$ | dark | ark two | \n",
+ "| | | ark o ark | |\n",
+ "| | | | | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "| `lazy version of zero or more` : `*?` | 5.*?5 | 5005 | 5005005 | \n",
+ "| | | 55 | |\n",
+ "| | | | | \n",
+ "+------------------------------------------------+-----------------+----------------+------------------+\n",
+ "\n",
+ "### Greediness\n",
+ "\n",
+ "In order to fully understand the last operation in the table, we have to discuss greediness. RegEx is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern `
.*
`. Given the sentence below, we would hope that the bolded portions would be matched:\n",
+ "\n",
+ "\"This is a **\\
example\\<\\/div>** of greediness \\
in\\<\\/div> regular expressions.”\n",
+ "\"\n",
+ "\n",
+ "In actuality, the way RegEx processes the text given that pattern is as follows:\n",
+ "\n",
+ "1. \"Look for the exact string \\<\\div>\" \n",
+ "\n",
+ "2. then, “look for any character 0 or more times\" \n",
+ "\n",
+ "3. then, “look for the exact string \\<\\/div>\"\n",
+ "\n",
+ "The result would be all the characters starting from the leftmost \\
and the rightmost \\<\\/div> (inclusive). We can fix this making our the pattern non-greedy, `
.*?
`. You can read up more on the documentation [here](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy).\n",
+ "\n",
+ "### Examples\n",
+ "\n",
+ "Let's revist our earlier problem of extracting date/time data from the given `.txt` files. Here is how the data looked."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [],
- "source": [
- "#| code-fold: true\n",
- "data = {\"SSN\": [\"987-65-4321\", \"forty\", \\\n",
- " \"123-45-6789 bro or 321-45-6789\",\n",
- " \"999-99-9999\"]}\n",
- "ssn_data = pd.DataFrame(data)"
+ "data": {
+ "text/plain": [
+ "'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'"
]
- },
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "log_lines[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question**: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and time zone.\n",
+ "\n",
+ "**Answer**: `\\[.*\\]`\n",
+ "\n",
+ "- Notice how matching the literal `[` and `]` is necessary. Therefore, an escape character `\\` is required before both `[` and `]` — otherwise these metacharacters will match character classes. \n",
+ "- We need to match a particular format between `[` and `]`. For this example, `.*` will suffice.\n",
+ "\n",
+ "**Alternative Solution**: `\\[\\w+/\\w+/\\w+:\\w+:\\w+:\\w+\\s-\\w+\\]`\n",
+ "\n",
+ "- This solution is much safer. \n",
+ " - Imagine the data between `[` and `]` was garbage - `.*` will still match that. \n",
+ " - The alternate solution will only match data that follows the correct format.\n",
+ "\n",
+ "## Regex in Python and Pandas (Regex Groups)\n",
+ "\n",
+ "### Canonicalization\n",
+ "\n",
+ "#### Canonicalization with Regex\n",
+ "\n",
+ "Earlier in this note, we examined the process of canonicalization using Python string manipulation and `pandas` `Series` methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let's fix this.\n",
+ "\n",
+ "To do so, we need to understand a few functions in the `re` module. The first of these is the substitute function: `re.sub(pattern, rep1, text)`. It behaves similarly to Python's built-in `.replace` function, and returns text with all instances of `pattern` replaced by `rep1`. \n",
+ "\n",
+ "The regular expression here removes text surrounded by `<>` (also known as HTML tags).\n",
+ "\n",
+ "In order, the pattern matches ... \n",
+ "1. a single `<`\n",
+ "2. any character that is not a `>` : div, td valign..., /td, /div\n",
+ "3. a single `>`\n",
+ "\n",
+ "Any substring in `text` that fulfills all three conditions will be replaced by `''`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\"\n",
+ "pattern = r\"<[^>]+>\"\n",
+ "re.sub(pattern, '', text) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice the `r` preceding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (i.e., the Python newline metacharacter `\\n`). This makes them useful for regular expressions, which often contain literal `\\` characters.\n",
+ "\n",
+ "In other words, don't forget to tag your RegEx with an `r`.\n",
+ "\n",
+ "#### Canonicalization with Pandas\n",
+ "\n",
+ "We can also use regular expressions with `pandas` `Series` methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: `ser.str.replace(pattern, repl, regex=True`).\n",
+ "\n",
+ "Consider the following `DataFrame` `html_data` with a single column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#| code-fold: true\n",
+ "data = {\"HTML\": [\"
"
],
- "source": [
- "ssn_data[\"SSN\"].str.extractall(pattern_cg)"
+ "text/plain": [
+ " SSN\n",
+ "0 987-65-4321\n",
+ "1 forty\n",
+ "2 123-45-6789 bro or 321-45-6789\n",
+ "3 999-99-9999"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Regular Expression Capture Groups\n",
- "\n",
- "Earlier we used parentheses `(` `)` to specify the highest order of operation in regular expressions. However, they have another meaning; paranthesis are often used to represent **capture groups**. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data. \n",
- "\n",
- "Let's take a look at an example.\n",
- "\n",
- "#### Example 1"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [],
- "source": [
- "text = \"Observations: 03:04:53 - Horse awakens. \\\n",
- " 03:05:14 - Horse goes back to sleep.\""
- ]
- },
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ssn_data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Say we want to capture all occurences of time data (hour, minute, and second) as *seperate entities*."
+ "data": {
+ "text/plain": [
+ "0 [987-65-4321]\n",
+ "1 []\n",
+ "2 [123-45-6789, 321-45-6789]\n",
+ "3 [999-99-9999]\n",
+ "Name: SSN, dtype: object"
]
- },
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ssn_data[\"SSN\"].str.findall(pattern)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This function returns a list for every row containing the pattern matches in a given string.\n",
+ "\n",
+ "As you may expect, there are similar `pandas` equivalents for other `re` functions as well. `Series.str.extract` takes in a pattern and returns a `DataFrame` of each capture group’s first match in the string. In contrast, `Series.str.extractall` returns a multi-indexed `DataFrame` of all matches for each capture group. You can see the difference in the outputs below:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[('03', '04', '53'), ('03', '05', '14')]"
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
0
\n",
+ "
1
\n",
+ "
2
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
987
\n",
+ "
65
\n",
+ "
4321
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
123
\n",
+ "
45
\n",
+ "
6789
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
999
\n",
+ "
99
\n",
+ "
9999
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
],
- "source": [
- "pattern_1 = r\"(\\d\\d):(\\d\\d):(\\d\\d)\"\n",
- "re.findall(pattern_1, text)"
+ "text/plain": [
+ " 0 1 2\n",
+ "0 987 65 4321\n",
+ "1 NaN NaN NaN\n",
+ "2 123 45 6789\n",
+ "3 999 99 9999"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Notice how the given pattern has 3 capture groups, each specified by the regular expression `(\\d\\d)`. We then use `re.findall` to return these capture groups, each as tuples containing 3 matches.\n",
- "\n",
- "These regular expression capture groups can be different. We can use the `(\\d{2})` shorthand to extract the same data."
- ]
- },
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pattern_cg = r\"([0-9]{3})-([0-9]{2})-([0-9]{4})\"\n",
+ "ssn_data[\"SSN\"].str.extract(pattern_cg)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[('03', '04', '53'), ('03', '05', '14')]"
- ]
- },
- "execution_count": 25,
- "metadata": {},
- "output_type": "execute_result"
- }
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
0
\n",
+ "
1
\n",
+ "
2
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
match
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
987
\n",
+ "
65
\n",
+ "
4321
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
0
\n",
+ "
123
\n",
+ "
45
\n",
+ "
6789
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
321
\n",
+ "
45
\n",
+ "
6789
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
0
\n",
+ "
999
\n",
+ "
99
\n",
+ "
9999
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
],
- "source": [
- "pattern_2 = r\"(\\d\\d):(\\d\\d):(\\d{2})\"\n",
- "re.findall(pattern_2, text)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Example 2\n",
- "\n",
- "With the notion of capture groups, convince yourself how the following regular expression works."
+ "text/plain": [
+ " 0 1 2\n",
+ " match \n",
+ "0 0 987 65 4321\n",
+ "2 0 123 45 6789\n",
+ " 1 321 45 6789\n",
+ "3 0 999 99 9999"
]
- },
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ssn_data[\"SSN\"].str.extractall(pattern_cg)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Regular Expression Capture Groups\n",
+ "\n",
+ "Earlier we used parentheses `(` `)` to specify the highest order of operation in regular expressions. However, they have another meaning; parentheses are often used to represent **capture groups**. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data. \n",
+ "\n",
+ "Let's take a look at an example.\n",
+ "\n",
+ "#### Example 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "text = \"Observations: 03:04:53 - Horse awakens. \\\n",
+ " 03:05:14 - Horse goes back to sleep.\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Say we want to capture all occurences of time data (hour, minute, and second) as *seperate entities*."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'"
- ]
- },
- "execution_count": 26,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "first = log_lines[0]\n",
- "first"
+ "data": {
+ "text/plain": [
+ "[('03', '04', '53'), ('03', '05', '14')]"
]
- },
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pattern_1 = r\"(\\d\\d):(\\d\\d):(\\d\\d)\"\n",
+ "re.findall(pattern_1, text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice how the given pattern has 3 capture groups, each specified by the regular expression `(\\d\\d)`. We then use `re.findall` to return these capture groups, each as tuples containing 3 matches.\n",
+ "\n",
+ "These regular expression capture groups can be different. We can use the `(\\d{2})` shorthand to extract the same data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "26 Jan 2014 10 47 58 -0800\n"
- ]
- }
- ],
- "source": [
- "pattern = r'\\[(\\d+)\\/(\\w+)\\/(\\d+):(\\d+):(\\d+):(\\d+) (.+)\\]'\n",
- "day, month, year, hour, minute, second, time_zone = re.findall(pattern, first)[0]\n",
- "print(day, month, year, hour, minute, second, time_zone)"
+ "data": {
+ "text/plain": [
+ "[('03', '04', '53'), ('03', '05', '14')]"
]
- },
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pattern_2 = r\"(\\d\\d):(\\d\\d):(\\d{2})\"\n",
+ "re.findall(pattern_2, text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Example 2\n",
+ "\n",
+ "With the notion of capture groups, convince yourself how the following regular expression works."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Limitations of Regular Expressions\n",
- "\n",
- "Today, we explored the capabilities of regular expressions in data wrangling with text data. However, there are a few things to be wary of.\n",
- "\n",
- "Writing regular expressions is like writing a program.\n",
- "\n",
- "- Need to know the syntax well.\n",
- "- Can be easier to write than to read.\n",
- "- Can be difficult to debug.\n",
- "\n",
- "Regular expressions are terrible at certain types of problems:\n",
- "\n",
- "- For parsing a hierarchical structure, such as JSON, use the `json.load()` parser, not regex!\n",
- "- Complex features (e.g. valid email address).\n",
- "- Counting (same number of instances of a and b). (impossible)\n",
- "- Complex properties (palindromes, balanced parentheses). (impossible)\n",
- "\n",
- "Ultimately, the goal is not to memorize all of regular expressions. Rather, the aim is to:\n",
- "\n",
- "- Understand what regex is capable of.\n",
- "- Parse and create regex, with a reference table\n",
- "- Use vocabulary (metacharacter, escape character, groups, etc.) to describe regex metacharacters.\n",
- "- Differentiate between (), [], {}\n",
- "- Design your own character classes with \\d, \\w, \\s, […-…], ^, etc.\n",
- "- Use Python and `pandas` regex methods.\n",
- "\n"
+ "data": {
+ "text/plain": [
+ "'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'"
]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
}
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.13"
+ ],
+ "source": [
+ "first = log_lines[0]\n",
+ "first"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "26 Jan 2014 10 47 58 -0800\n"
+ ]
}
+ ],
+ "source": [
+ "pattern = r'\\[(\\d+)\\/(\\w+)\\/(\\d+):(\\d+):(\\d+):(\\d+) (.+)\\]'\n",
+ "day, month, year, hour, minute, second, time_zone = re.findall(pattern, first)[0]\n",
+ "print(day, month, year, hour, minute, second, time_zone)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Limitations of Regular Expressions\n",
+ "\n",
+ "Today, we explored the capabilities of regular expressions in data wrangling with text data. However, there are a few things to be wary of.\n",
+ "\n",
+ "Writing regular expressions is like writing a program.\n",
+ "\n",
+ "- Need to know the syntax well.\n",
+ "- Can be easier to write than to read.\n",
+ "- Can be difficult to debug.\n",
+ "\n",
+ "Regular expressions are terrible at certain types of problems:\n",
+ "\n",
+ "- For parsing a hierarchical structure, such as JSON, use the `json.load()` parser, not RegEx!\n",
+ "- Complex features (e.g. valid email address).\n",
+ "- Counting (same number of instances of a and b). (impossible)\n",
+ "- Complex properties (palindromes, balanced parentheses). (impossible)\n",
+ "\n",
+ "Ultimately, the goal is not to memorize all regular expressions. Rather, the aim is to:\n",
+ "\n",
+ "- Understand what RegEx is capable of.\n",
+ "- Parse and create RegEx, with a reference table\n",
+ "- Use vocabulary (metacharacter, escape character, groups, etc.) to describe regex metacharacters.\n",
+ "- Differentiate between (), [], {}\n",
+ "- Design your own character classes with \\d, \\w, \\s, […-…], ^, etc.\n",
+ "- Use Python and `pandas` RegEx methods.\n",
+ "\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
},
- "nbformat": 4,
- "nbformat_minor": 4
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.16"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
}
diff --git a/regex/regex.qmd b/regex/regex.qmd
index 034ecd02..cf7edee3 100644
--- a/regex/regex.qmd
+++ b/regex/regex.qmd
@@ -13,31 +13,33 @@ format:
jupyter: python3
---
-::: {.callout-note collapse="true"}
+::: {.callout-note collapse="false"}
## Learning Outcomes
-- Understand Python string manipulation, Pandas Series methods
+- Understand Python string manipulation, `pandas` `Series` methods
- Parse and create regex, with a reference table
-- Use vocabulary (closure, metacharater, groups, etc.) to describe regex metacharacters
+- Use vocabulary (closure, metacharacters, groups, etc.) to describe regex metacharacters
:::
+**This content is covered in lectures 6 and 7.**
+
## Why Work with Text?
-Last lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data - the primary focus of today's lecture. In this note, we'll discuss the necessary tools to manipulate text: Python string manipulation and regular expressions.
+Last lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data — the primary focus of lecture 6. In this note, we'll discuss the necessary tools to manipulate text: Python string manipulation and regular expressions.
There are two main reasons for working with text.
1. Canonicalization: Convert data that has multiple formats into a standard form.
- - By manipulating text, we can join tables with mismatched string labels
+ - By manipulating text, we can join tables with mismatched string labels.
2. Extract information into a new feature.
- - For example, we can extract date and time features from text
+ - For example, we can extract date and time features from text.
## Python String Methods
-First, we'll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and `pandas`. The Python functions operate on a single string, while their equivalent in `pandas` are **vectorized** - they operate on a Series of string data.
+First, we'll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and `pandas`. The Python functions operate on a single string, while their equivalent in `pandas` are **vectorized** — they operate on a `Series` of string data.
+-----------------------+-----------------+---------------------------+
-| Operation | Python | Pandas (Series) |
+| Operation | Python | `Pandas` (`Series`) |
+=======================+=================+===========================+
| Transformation | - `s.lower(_)` | - `ser.str.lower(_)` |
| | - `s.upper(_)` | - `ser.str.upper(_)` |
@@ -58,7 +60,7 @@ First, we'll introduce a few methods useful for string manipulation. The followi
| | | |
+-----------------------+-----------------+---------------------------+
-We'll discuss the differences between Python string functions and `pandas` Series methods in the following section on canonicalization.
+We'll discuss the differences between Python string functions and `pandas` `Series` methods in the following section on canonicalization.
### Canonicalization
Assume we want to merge the given tables.
@@ -78,7 +80,7 @@ with open('data/county_and_population.csv') as f:
display(county_and_state), display(county_and_pop);
```
-Last time, we used a **primary key** and **foreign key** to join two tables. While neither of these keys exist in our DataFrames, the `County` columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables?
+Last time, we used a **primary key** and **foreign key** to join two tables. While neither of these keys exist in our `DataFrame`s, the `"County"` columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables?
#### Canonicalization with Python String Manipulation
@@ -99,7 +101,7 @@ def canonicalize_county(county_name):
canonicalize_county("St. John the Baptist")
```
-We will use the `pandas` `map` function to apply the `canonicalize_county` function to every row in both DataFrames. In doing so, we'll create a new column in each called `clean_county_python` with the canonical form.
+We will use the `pandas` `map` function to apply the `canonicalize_county` function to every row in both `DataFrame`s. In doing so, we'll create a new column in each called `clean_county_python` with the canonical form.
```{python}
county_and_pop['clean_county_python'] = county_and_pop['County'].map(canonicalize_county)
@@ -112,9 +114,9 @@ display(county_and_state), display(county_and_pop);
#### Canonicalization with Pandas Series Methods
-Alternatively, we can use `pandas` Series methods to create this standardized column. To do so, we must call the `.str` attribute of our Series object prior to calling any methods, like `.lower` and `.replace`. Notice how these method names match their equivalent built-in Python string functions.
+Alternatively, we can use `pandas` `Series` methods to create this standardized column. To do so, we must call the `.str` attribute of our `Series` object prior to calling any methods, like `.lower` and `.replace`. Notice how these method names match their equivalent built-in Python string functions.
-Chaining multiple Series methods in this manner eliminates the need to use the `map` function (as this code is vectorized).
+Chaining multiple `Series` methods in this manner eliminates the need to use the `map` function (as this code is vectorized).
```{python}
def canonicalize_county_series(county_series):
@@ -151,7 +153,7 @@ with open('data/log.txt', 'r') as f:
log_lines
```
-Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won't work.
+Suppose we want to extract the day, month, year, hour, minutes, seconds, and time zone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won't work.
Instead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further seperated by `/` and `:`. We can hone in on this region of text, and split the data on these characters. Python's built-in `.split` function makes this easy.
@@ -167,9 +169,9 @@ day, month, year, hour, minute, seconds, time_zone
There are two problems with this code:
-1. Python's built-in functions limit us to extract data one record at a time
- - This can be resolved using a map function or Pandas Series methods.
-2. The code is quite verbose
+1. Python's built-in functions limit us to extract data one record at a time,
+ - This can be resolved using the `map` function or `pandas` `Series` methods.
+2. The code is quite verbose.
- This is a larger issue that is trickier to solve
In the next section, we'll introduce regular expressions - a tool that solves problem 2.
@@ -243,7 +245,7 @@ Notice how these metacharacter operations are ordered. Rather than being literal
- Notice how the outer capture group surrounds the `|`.
- Consider the regex `m(uu(uu)*)|(oo(oo)*)n`. This incorrectly matches `muu` and `oooon`.
- Each OR clause is everything to the left and right of `|`. The incorrect solution matches only half of the string, and ignores either the beginning `m` or trailing `n`.
- - A set of paranthesis must surround `|`. That way, each OR clause is everything to the left and right of `|` **within** the group. This ensures both the beginning `m` *and* trailing `n` are matched.
+ - A set of parenthesis must surround `|`. That way, each OR clause is everything to the left and right of `|` **within** the group. This ensures both the beginning `m` *and* trailing `n` are matched.
## Regex Expanded
@@ -312,10 +314,10 @@ Let's analyze a few examples of complex regular expressions.
1. `.*SPB.*` only matches strings that contain the substring `SPB`.
- The `.*` metacharacter matches any amount of non-negative characters. Newlines do not count.
-2. This regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit
- - You'll recognize this as the familiar Social Security Number regular expression
+2. This regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.
+ - You'll recognize this as the familiar Social Security Number regular expression.
3. Matches any email with a `com` or `edu` domain, where all characters of the email are letters.
- - At least one `.` must preceed the domain name. Including a backslash `\` before any metacharacter (in this case, the `.`) tells regex to match that character exactly.
+ - At least one `.` must precede the domain name. Including a backslash `\` before any metacharacter (in this case, the `.`) tells RegEx to match that character exactly.
## Convenient Regex
@@ -347,7 +349,24 @@ Here are a few more convenient regular expressions.
| | | | |
+------------------------------------------------+-----------------+----------------+------------------+
-#### Examples
+### Greediness
+
+In order to fully understand the last operation in the table, we have to discuss greediness. RegEx is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern `
.*
`. Given the sentence below, we would hope that the bolded portions would be matched:
+
+"This is a **\
example\<\/div>** of greediness \
in\<\/div> regular expressions.”
+"
+
+In actuality, the way RegEx processes the text given that pattern is as follows:
+
+1. "Look for the exact string \<\div>"
+
+2. then, “look for any character 0 or more times"
+
+3. then, “look for the exact string \<\/div>"
+
+The result would be all the characters starting from the leftmost \
and the rightmost \<\/div> (inclusive). We can fix this making our the pattern non-greedy, `
.*?
`. You can read up more on the documentation [here](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy).
+
+### Examples
Let's revist our earlier problem of extracting date/time data from the given `.txt` files. Here is how the data looked.
@@ -355,11 +374,11 @@ Let's revist our earlier problem of extracting date/time data from the given `.t
log_lines[0]
```
-**Question**: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and timezone.
+**Question**: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and time zone.
**Answer**: `\[.*\]`
-- Notice how matching the literal `[` and `]` is necessary. Therefore, an escape character `\` is required before both `[` and `]` - otherwise these metacharacters will match character classes.
+- Notice how matching the literal `[` and `]` is necessary. Therefore, an escape character `\` is required before both `[` and `]` — otherwise these metacharacters will match character classes.
- We need to match a particular format between `[` and `]`. For this example, `.*` will suffice.
**Alternative Solution**: `\[\w+/\w+/\w+:\w+:\w+:\w+\s-\w+\]`
@@ -374,9 +393,9 @@ log_lines[0]
#### Canonicalization with Regex
-Earlier in this note, we examined the process of canonicalization using Python string manipulation and `pandas` Series methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let's fix this.
+Earlier in this note, we examined the process of canonicalization using Python string manipulation and `pandas` `Series` methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let's fix this.
-To do so, we need to understand a few functions in the `re` module. The first of these is the substitute function: `re.sub(pattern, rep1, text)`. It behaves similarily to Python's built-in `.replace` function, and returns text with all instances of `pattern` replaced by `rep1`.
+To do so, we need to understand a few functions in the `re` module. The first of these is the substitute function: `re.sub(pattern, rep1, text)`. It behaves similarly to Python's built-in `.replace` function, and returns text with all instances of `pattern` replaced by `rep1`.
The regular expression here removes text surrounded by `<>` (also known as HTML tags).
@@ -395,15 +414,15 @@ pattern = r"<[^>]+>"
re.sub(pattern, '', text)
```
-Notice the `r` preceeding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (ie the Python newline metacharacter `\n`). This makes them useful for regular expressions, which often contain literal `\` characters.
+Notice the `r` preceding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (i.e., the Python newline metacharacter `\n`). This makes them useful for regular expressions, which often contain literal `\` characters.
-In other words, don't forget to tag your regex with a `r`.
+In other words, don't forget to tag your RegEx with an `r`.
#### Canonicalization with Pandas
-We can also use regular expressions with Pandas Series methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: `ser.str.replace(pattern, repl, regex=True`).
+We can also use regular expressions with `pandas` `Series` methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: `ser.str.replace(pattern, repl, regex=True`).
-Consider the following DataFrame `html_data` with a single column.
+Consider the following `DataFrame` `html_data` with a single column.
```{python}
#| code-fold: true
@@ -438,9 +457,9 @@ re.findall(pattern, text)
#### Extraction with Pandas
-Pandas similarily provides extraction functionality on a Series of data: `ser.str.findall(pattern)`
+Pandas similarily provides extraction functionality on a `Series` of data: `ser.str.findall(pattern)`
-Consider the following DataFrame `ssn_data`.
+Consider the following `DataFrame` `ssn_data`.
```{python}
#| code-fold: true
@@ -460,9 +479,20 @@ ssn_data["SSN"].str.findall(pattern)
This function returns a list for every row containing the pattern matches in a given string.
+As you may expect, there are similar `pandas` equivalents for other `re` functions as well. `Series.str.extract` takes in a pattern and returns a `DataFrame` of each capture group’s first match in the string. In contrast, `Series.str.extractall` returns a multi-indexed `DataFrame` of all matches for each capture group. You can see the difference in the outputs below:
+
+```{python}
+pattern_cg = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"
+ssn_data["SSN"].str.extract(pattern_cg)
+```
+
+```{python}
+ssn_data["SSN"].str.extractall(pattern_cg)
+```
+
### Regular Expression Capture Groups
-Earlier we used parentheses `(` `)` to specify the highest order of operation in regular expressions. However, they have another meaning; paranthesis are often used to represent **capture groups**. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data.
+Earlier we used parentheses `(` `)` to specify the highest order of operation in regular expressions. However, they have another meaning; parentheses are often used to represent **capture groups**. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data.
Let's take a look at an example.
@@ -516,8 +546,18 @@ Writing regular expressions is like writing a program.
Regular expressions are terrible at certain types of problems:
-- For parsing a hierarchical structure, such as JSON, use the `json.load()` parser, not regex!
+- For parsing a hierarchical structure, such as JSON, use the `json.load()` parser, not RegEx!
- Complex features (e.g. valid email address).
- Counting (same number of instances of a and b). (impossible)
- Complex properties (palindromes, balanced parentheses). (impossible)
+Ultimately, the goal is not to memorize all regular expressions. Rather, the aim is to:
+
+- Understand what RegEx is capable of.
+- Parse and create RegEx, with a reference table
+- Use vocabulary (metacharacter, escape character, groups, etc.) to describe regex metacharacters.
+- Differentiate between (), [], {}
+- Design your own character classes with \d, \w, \s, […-…], ^, etc.
+- Use Python and `pandas` RegEx methods.
+
+