diff --git a/docs/search.json b/docs/search.json index c0e69fd8..332d64b9 100644 --- a/docs/search.json +++ b/docs/search.json @@ -278,5 +278,54 @@ "title": "5  Data Cleaning and EDA", "section": "8.2 EDA and Data Wrangling", "text": "8.2 EDA and Data Wrangling\nThere are several ways to approach EDA and Data Wrangling:\n\nExamine the data and metadata: what is the date, size, organization, and structure of the data?\nExamine each field/attribute/dimension individually.\nExamine pairs of related dimensions (e.g. breaking down grades by major).\nAlong the way, we can:\n\nVisualize or summarize the data.\nValidate assumptions about data and its collection process. Pay particular attention to when the data was collected.\nIdentify and address anomalies.\nApply data transformations and corrections (we’ll cover this in the upcoming lecture).\nRecord everything you do! Developing in Jupyter Notebook promotes reproducibility of your own work!" + }, + { + "objectID": "regex/regex.html#why-work-with-text", + "href": "regex/regex.html#why-work-with-text", + "title": "6  Regular Expressions", + "section": "6.1 Why Work with Text?", + "text": "6.1 Why Work with Text?\nLast lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data — the primary focus of lecture 6. In this note, we’ll discuss the necessary tools to manipulate text: python string manipulation and regular expressions.\nThere are two main reasons for working with text.\n\nCanonicalization: Convert data that has multiple formats into a standard form.\n\nBy manipulating text, we can join tables with mismatched string labels.\n\nExtract information into a new feature.\n\nFor example, we can extract date and time features from text." + }, + { + "objectID": "regex/regex.html#python-string-methods", + "href": "regex/regex.html#python-string-methods", + "title": "6  Regular Expressions", + "section": "6.2 Python String Methods", + "text": "6.2 Python String Methods\nFirst, we’ll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by python and pandas. The python functions operate on a single string, while their equivalent in pandas are vectorized — they operate on a Series of string data.\n\n\n\n\n\n\n\n\nOperation\nPython\nPandas (Series)\n\n\n\n\nTransformation\n\ns.lower(_)\ns.upper(_)\n\n\nser.str.lower(_)\nser.str.upper(_)\n\n\n\nReplacement + Deletion\n\ns.replace(_)\n\n\nser.str.replace(_)\n\n\n\nSplit\n\ns.split(_)\n\n\nser.str.split(_)\n\n\n\nSubstring\n\ns[1:4]\n\n\nser.str[1:4]\n\n\n\nMembership\n\n'_' in s\n\n\nser.str.contains(_)\n\n\n\nLength\n\nlen(s)\n\n\nser.str.len()\n\n\n\n\nWe’ll discuss the differences between python string functions and pandas Series methods in the following section on canonicalization.\n\n6.2.1 Canonicalization\nAssume we want to merge the given tables.\n\n\nCode\nimport pandas as pd\n\nwith open('data/county_and_state.csv') as f:\n county_and_state = pd.read_csv(f)\n \nwith open('data/county_and_population.csv') as f:\n county_and_pop = pd.read_csv(f)\n\n\n\ndisplay(county_and_state), display(county_and_pop);\n\n\n\n\n\n\n\n\nCounty\nState\n\n\n\n\n0\nDe Witt County\nIL\n\n\n1\nLac qui Parle County\nMN\n\n\n2\nLewis and Clark County\nMT\n\n\n3\nSt John the Baptist Parish\nLS\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty\nPopulation\n\n\n\n\n0\nDeWitt\n16798\n\n\n1\nLac Qui Parle\n8067\n\n\n2\nLewis & Clark\n55716\n\n\n3\nSt. John the Baptist\n43044\n\n\n\n\n\n\n\nLast time, we used a primary key and foreign key to join two tables. While neither of these keys exist in our DataFrames, the \"County\" columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables?\n\n6.2.1.1 Canonicalization with python String Manipulation\nThe following function uses python string manipulation to convert a single county name into canonical form. It does so by eliminating whitespace, punctuation, and unnecessary text.\n\ndef canonicalize_county(county_name):\n return (\n county_name\n .lower()\n .replace(' ', '')\n .replace('&', 'and')\n .replace('.', '')\n .replace('county', '')\n .replace('parish', '')\n )\n\ncanonicalize_county(\"St. John the Baptist\")\n\n'stjohnthebaptist'\n\n\nWe will use the pandas map function to apply the canonicalize_county function to every row in both DataFrames. In doing so, we’ll create a new column in each called clean_county_python with the canonical form.\n\ncounty_and_pop['clean_county_python'] = county_and_pop['County'].map(canonicalize_county)\ncounty_and_state['clean_county_python'] = county_and_state['County'].map(canonicalize_county)\ndisplay(county_and_state), display(county_and_pop);\n\n\n\n\n\n\n\n\nCounty\nState\nclean_county_python\n\n\n\n\n0\nDe Witt County\nIL\ndewitt\n\n\n1\nLac qui Parle County\nMN\nlacquiparle\n\n\n2\nLewis and Clark County\nMT\nlewisandclark\n\n\n3\nSt John the Baptist Parish\nLS\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty\nPopulation\nclean_county_python\n\n\n\n\n0\nDeWitt\n16798\ndewitt\n\n\n1\nLac Qui Parle\n8067\nlacquiparle\n\n\n2\nLewis & Clark\n55716\nlewisandclark\n\n\n3\nSt. John the Baptist\n43044\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n6.2.1.2 Canonicalization with Pandas Series Methods\nAlternatively, we can use pandas Series methods to create this standardized column. To do so, we must call the .str attribute of our Series object prior to calling any methods, like .lower and .replace. Notice how these method names match their equivalent built-in Python string functions.\nChaining multiple Series methods in this manner eliminates the need to use the map function (as this code is vectorized).\n\ndef canonicalize_county_series(county_series):\n return (\n county_series\n .str.lower()\n .str.replace(' ', '')\n .str.replace('&', 'and')\n .str.replace('.', '')\n .str.replace('county', '')\n .str.replace('parish', '')\n )\n\ncounty_and_pop['clean_county_pandas'] = canonicalize_county_series(county_and_pop['County'])\ncounty_and_state['clean_county_pandas'] = canonicalize_county_series(county_and_state['County'])\ndisplay(county_and_pop), display(county_and_state);\n\n/var/folders/sy/b85yc0p951zdr__z5hvdmbjm0000gn/T/ipykernel_80860/2523629438.py:7: FutureWarning:\n\nThe default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n\n/var/folders/sy/b85yc0p951zdr__z5hvdmbjm0000gn/T/ipykernel_80860/2523629438.py:7: FutureWarning:\n\nThe default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n\n\n\n\n\n\n\n\n\n\nCounty\nPopulation\nclean_county_python\nclean_county_pandas\n\n\n\n\n0\nDeWitt\n16798\ndewitt\ndewitt\n\n\n1\nLac Qui Parle\n8067\nlacquiparle\nlacquiparle\n\n\n2\nLewis & Clark\n55716\nlewisandclark\nlewisandclark\n\n\n3\nSt. John the Baptist\n43044\nstjohnthebaptist\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty\nState\nclean_county_python\nclean_county_pandas\n\n\n\n\n0\nDe Witt County\nIL\ndewitt\ndewitt\n\n\n1\nLac qui Parle County\nMN\nlacquiparle\nlacquiparle\n\n\n2\nLewis and Clark County\nMT\nlewisandclark\nlewisandclark\n\n\n3\nSt John the Baptist Parish\nLS\nstjohnthebaptist\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n\n6.2.2 Extraction\nExtraction explores the idea of obtaining useful information from text data. This will be particularily important in model building, which we’ll study in a few weeks.\nSay we want to read some data from a .txt file.\n\nwith open('data/log.txt', 'r') as f:\n log_lines = f.readlines()\n\nlog_lines\n\n['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n',\n '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] \"GET /stat141/Notes/dim.html HTTP/1.0\" 404 302 \"http://eeyore.ucdavis.edu/stat141/Notes/session.html\"\\n',\n '169.237.46.240 - \"\" [3/Feb/2006:10:18:37 -0800] \"GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1\"\\n']\n\n\nSuppose we want to extract the day, month, year, hour, minutes, seconds, and time zone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won’t work.\nInstead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further seperated by / and :. We can hone in on this region of text, and split the data on these characters. Python’s built-in .split function makes this easy.\n\nfirst = log_lines[0] # Only considering the first row of data\n\npertinent = first.split(\"[\")[1].split(']')[0]\nday, month, rest = pertinent.split('/')\nyear, hour, minute, rest = rest.split(':')\nseconds, time_zone = rest.split(' ')\nday, month, year, hour, minute, seconds, time_zone\n\n('26', 'Jan', '2014', '10', '47', '58', '-0800')\n\n\nThere are two problems with this code:\n\nPython’s built-in functions limit us to extract data one record at a time,\n\nThis can be resolved using the map function or pandas Series methods.\n\nThe code is quite verbose.\n\nThis is a larger issue that is trickier to solve\n\n\nIn the next section, we’ll introduce regular expressions - a tool that solves problem 2." + }, + { + "objectID": "regex/regex.html#regex-basics", + "href": "regex/regex.html#regex-basics", + "title": "6  Regular Expressions", + "section": "6.3 Regex Basics", + "text": "6.3 Regex Basics\nA regular expression (“RegEx”) is a sequence of characters that specifies a search pattern. They are written to extract specific information from text. Regular expressions are essentially part of a smaller programming language embedded in python, made available through the re module. As such, they have a stand-alone syntax and methods for various capabilities.\nRegular expressions are useful in many applications beyond data science. For example, Social Security Numbers (SSNs) are often validated with regular expressions.\n\nr\"[0-9]{3}-[0-9]{2}-[0-9]{4}\" # Regular Expression Syntax\n\n# 3 of any digit, then a dash,\n# then 2 of any digit, then a dash,\n# then 4 of any digit\n\n'[0-9]{3}-[0-9]{2}-[0-9]{4}'\n\n\n\n\nThere are a ton of resources to learn and experiment with regular expressions. A few are provided below:\n\nOfficial Regex Guide\nData 100 Reference Sheet\nRegex101.com\n\nBe sure to check Python under the category on the left.\n\n\n\n6.3.1 Basics Regex Syntax\nThere are four basic operations with regular expressions.\n\n\n\n\n\n\n\n\n\n\nOperation\nOrder\nSyntax Example\nMatches\nDoesn’t Match\n\n\n\n\nOr: |\n4\nAA|BAAB\nAA BAAB\nevery other string\n\n\nConcatenation\n3\nAABAAB\nAABAAB\nevery other string\n\n\nClosure: * (zero or more)\n2\nAB*A\nAA ABBBBBBA\nAB ABABA\n\n\nGroup: () (parenthesis)\n1\nA(A|B)AAB (AB)*A\nAAAAB ABAAB A ABABABABA\nevery other string AA ABBA\n\n\n\nNotice how these metacharacter operations are ordered. Rather than being literal characters, these metacharacters manipulate adjacent characters. () takes precedence, followed by *, and finally |. This allows us to differentiate between very different regex commands like AB* and (AB)*. The former reads “A then zero or more copies of B”, while the latter specifies “zero or more copies of AB”.\n\n6.3.1.1 Examples\nQuestion 1: Give a regular expression that matches moon, moooon, etc. Your expression should match any even number of os except zero (i.e. don’t match mn).\nAnswer 1: moo(oo)*n\n\nHardcoding oo before the capture group ensures that mn is not matched.\nA capture group of (oo)* ensures the number of o’s is even.\n\nQuestion 2: Using only basic operations, formulate a regex that matches muun, muuuun, moon, moooon, etc. Your expression should match any even number of us or os except zero (i.e. don’t match mn).\nAnswer 2: m(uu(uu)*|oo(oo)*)n\n\nThe leading m and trailing n ensures that only strings beginning with m and ending with n are matched.\nNotice how the outer capture group surrounds the |.\n\nConsider the regex m(uu(uu)*)|(oo(oo)*)n. This incorrectly matches muu and oooon.\n\nEach OR clause is everything to the left and right of |. The incorrect solution matches only half of the string, and ignores either the beginning m or trailing n.\nA set of parenthesis must surround |. That way, each OR clause is everything to the left and right of | within the group. This ensures both the beginning m and trailing n are matched." + }, + { + "objectID": "regex/regex.html#regex-expanded", + "href": "regex/regex.html#regex-expanded", + "title": "6  Regular Expressions", + "section": "6.4 Regex Expanded", + "text": "6.4 Regex Expanded\nProvided below are more complex regular expression functions.\n\n\n\n\n\n\n\n\n\nOperation\nSyntax Example\nMatches\nDoesn’t Match\n\n\n\n\nAny Character: . (except newline)\n.U.U.U.\nCUMULUS JUGULUM\nSUCCUBUS TUMULTUOUS\n\n\nCharacter Class: [] (match one character in [])\n[A-Za-z][a-z]*\nword Capitalized\ncamelCase 4illegal\n\n\nRepeated \"a\" Times: {a}\nj[aeiou]{3}hn\njaoehn jooohn\njhn jaeiouhn\n\n\nRepeated \"from a to b\" Times: {a, b}\nj[0u]{1,2}hn\njohn juohn\njhn jooohn\n\n\nAt Least One: +\njo+hn\njohn joooooohn\njhn jjohn\n\n\nZero or One: ?\njoh?n\njon john\nany other string\n\n\n\nA character class matches a single character in it’s class. These characters can be hardcoded – in the case of [aeiou] – or shorthand can be specified to mean a range of characters. Examples include:\n\n[A-Z]: Any capitalized letter\n[a-z]: Any lowercase letter\n[0-9]: Any single digit\n[A-Za-z]: Any capitalized of lowercase letter\n[A-Za-z0-9]: Any capitalized or lowercase letter or single digit\n\n\n6.4.0.1 Examples\nLet’s analyze a few examples of complex regular expressions.\n\n\n\n\n\n\n\nMatches\nDoes Not Match\n\n\n\n\n\n.*SPB.*\n\n\n\n\nRASPBERRY SPBOO\nSUBSPACE SUBSPECIES\n\n\n\n[0-9]{3}-[0-9]{2}-[0-9]{4}\n\n\n\n\n231-41-5121 573-57-1821\n231415121 57-3571821\n\n\n\n[a-z]+@([a-z]+\\.)+(edu|com)\n\n\n\n\nhorse@pizza.com horse@pizza.food.com\nfrank_99@yahoo.com hug@cs\n\n\n\nExplanations\n\n.*SPB.* only matches strings that contain the substring SPB.\n\nThe .* metacharacter matches any amount of non-negative characters. Newlines do not count.\n\n\nThis regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.\n\nYou’ll recognize this as the familiar Social Security Number regular expression.\n\nMatches any email with a com or edu domain, where all characters of the email are letters.\n\nAt least one . must precede the domain name. Including a backslash \\ before any metacharacter (in this case, the .) tells RegEx to match that character exactly." + }, + { + "objectID": "regex/regex.html#convenient-regex", + "href": "regex/regex.html#convenient-regex", + "title": "6  Regular Expressions", + "section": "6.5 Convenient Regex", + "text": "6.5 Convenient Regex\nHere are a few more convenient regular expressions.\n\n\n\n\n\n\n\n\n\nOperation\nSyntax Example\nMatches\nDoesn’t Match\n\n\n\n\nbuilt in character class\n\\w+ \\d+ \\s+ \nFawef_03 231123 whitespace\nthis person 423 people non-whitespace\n\n\ncharacter class negation: [^] (everything except the given characters)\n[^a-z]+.\nPEPPERS3982 17211!↑å\nporch CLAmS\n\n\nescape character: \\ (match the literal next character)\ncow\\.com\ncow.com\ncowscom\n\n\nbeginning of line: ^\n^ark\nark two ark o ark\ndark\n\n\nend of line: $\nark$\ndark ark o ark\nark two\n\n\nlazy version of zero or more : *?\n5.*?5\n5005 55\n5005005\n\n\n\n\n6.5.1 Greediness\nIn order to fully understand the last operation in the table, we have to discuss greediness. RegEx is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern <div>.*</div>. Given the sentence below, we would hope that the bolded portions would be matched:\n“This is a <div>example</div> of greediness <div>in</div> regular expressions.” ”\nIn actuality, the way RegEx processes the text given that pattern is as follows:\n\n“Look for the exact string <>”\nthen, “look for any character 0 or more times”\nthen, “look for the exact string </div>”\n\nThe result would be all the characters starting from the leftmost <div> and the rightmost </div> (inclusive). We can fix this making our the pattern non-greedy, <div>.*?</div>. You can read up more on the documentation here.\n\n\n6.5.2 Examples\nLet’s revist our earlier problem of extracting date/time data from the given .txt files. Here is how the data looked.\n\nlog_lines[0]\n\n'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'\n\n\nQuestion: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and time zone.\nAnswer: \\[.*\\]\n\nNotice how matching the literal [ and ] is necessary. Therefore, an escape character \\ is required before both [ and ] — otherwise these metacharacters will match character classes.\nWe need to match a particular format between [ and ]. For this example, .* will suffice.\n\nAlternative Solution: \\[\\w+/\\w+/\\w+:\\w+:\\w+:\\w+\\s-\\w+\\]\n\nThis solution is much safer.\n\nImagine the data between [ and ] was garbage - .* will still match that.\nThe alternate solution will only match data that follows the correct format." + }, + { + "objectID": "regex/regex.html#regex-in-python-and-pandas-regex-groups", + "href": "regex/regex.html#regex-in-python-and-pandas-regex-groups", + "title": "6  Regular Expressions", + "section": "6.6 Regex in Python and Pandas (RegEx Groups)", + "text": "6.6 Regex in Python and Pandas (RegEx Groups)\n\n6.6.1 Canonicalization\n\n6.6.1.1 Canonicalization with RegEx\nEarlier in this note, we examined the process of canonicalization using python string manipulation and pandas Series methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let’s fix this.\nTo do so, we need to understand a few functions in the re module. The first of these is the substitute function: re.sub(pattern, rep1, text). It behaves similarly to python’s built-in .replace function, and returns text with all instances of pattern replaced by rep1.\nThe regular expression here removes text surrounded by <> (also known as HTML tags).\nIn order, the pattern matches … 1. a single < 2. any character that is not a > : div, td valign…, /td, /div 3. a single >\nAny substring in text that fulfills all three conditions will be replaced by ''.\n\nimport re\n\ntext = \"<div><td valign='top'>Moo</td></div>\"\npattern = r\"<[^>]+>\"\nre.sub(pattern, '', text) \n\n'Moo'\n\n\nNotice the r preceding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (i.e., the Python newline metacharacter \\n). This makes them useful for regular expressions, which often contain literal \\ characters.\nIn other words, don’t forget to tag your RegEx with an r.\n\n\n6.6.1.2 Canonicalization with pandas\nWe can also use regular expressions with pandas Series methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: ser.str.replace(pattern, repl, regex=True).\nConsider the following DataFrame html_data with a single column.\n\n\nCode\ndata = {\"HTML\": [\"<div><td valign='top'>Moo</td></div>\", \\\n \"<a href='http://ds100.org'>Link</a>\", \\\n \"<b>Bold text</b>\"]}\nhtml_data = pd.DataFrame(data)\n\n\n\nhtml_data\n\n\n\n\n\n\n\n\nHTML\n\n\n\n\n0\n<div><td valign='top'>Moo</td></div>\n\n\n1\n<a href='http://ds100.org'>Link</a>\n\n\n2\n<b>Bold text</b>\n\n\n\n\n\n\n\n\npattern = r\"<[^>]+>\"\nhtml_data['HTML'].str.replace(pattern, '', regex=True)\n\n0 Moo\n1 Link\n2 Bold text\nName: HTML, dtype: object\n\n\n\n\n\n6.6.2 Extraction\n\n6.6.2.1 Extraction with RegEx\nJust like with canonicalization, the re module provides capability to extract relevant text from a string: re.findall(pattern, text). This function returns a list of all matches to pattern.\nUsing the familiar regular expression for Social Security Numbers:\n\ntext = \"My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789.\"\npattern = r\"[0-9]{3}-[0-9]{2}-[0-9]{4}\"\nre.findall(pattern, text) \n\n['123-45-6789', '321-45-6789']\n\n\n\n\n6.6.2.2 Extraction with pandas\npandas similarily provides extraction functionality on a Series of data: ser.str.findall(pattern)\nConsider the following DataFrame ssn_data.\n\n\nCode\ndata = {\"SSN\": [\"987-65-4321\", \"forty\", \\\n \"123-45-6789 bro or 321-45-6789\",\n \"999-99-9999\"]}\nssn_data = pd.DataFrame(data)\n\n\n\nssn_data\n\n\n\n\n\n\n\n\nSSN\n\n\n\n\n0\n987-65-4321\n\n\n1\nforty\n\n\n2\n123-45-6789 bro or 321-45-6789\n\n\n3\n999-99-9999\n\n\n\n\n\n\n\n\nssn_data[\"SSN\"].str.findall(pattern)\n\n0 [987-65-4321]\n1 []\n2 [123-45-6789, 321-45-6789]\n3 [999-99-9999]\nName: SSN, dtype: object\n\n\nThis function returns a list for every row containing the pattern matches in a given string.\nAs you may expect, there are similar pandas equivalents for other re functions as well. Series.str.extract takes in a pattern and returns a DataFrame of each capture group’s first match in the string. In contrast, Series.str.extractall returns a multi-indexed DataFrame of all matches for each capture group. You can see the difference in the outputs below:\n\npattern_cg = r\"([0-9]{3})-([0-9]{2})-([0-9]{4})\"\nssn_data[\"SSN\"].str.extract(pattern_cg)\n\n\n\n\n\n\n\n\n0\n1\n2\n\n\n\n\n0\n987\n65\n4321\n\n\n1\nNaN\nNaN\nNaN\n\n\n2\n123\n45\n6789\n\n\n3\n999\n99\n9999\n\n\n\n\n\n\n\n\nssn_data[\"SSN\"].str.extractall(pattern_cg)\n\n\n\n\n\n\n\n\n\n0\n1\n2\n\n\n\nmatch\n\n\n\n\n\n\n\n0\n0\n987\n65\n4321\n\n\n2\n0\n123\n45\n6789\n\n\n1\n321\n45\n6789\n\n\n3\n0\n999\n99\n9999\n\n\n\n\n\n\n\n\n\n\n6.6.3 Regular Expression Capture Groups\nEarlier we used parentheses ( ) to specify the highest order of operation in regular expressions. However, they have another meaning; parentheses are often used to represent capture groups. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data.\nLet’s take a look at an example.\n\n6.6.3.1 Example 1\n\ntext = \"Observations: 03:04:53 - Horse awakens. \\\n 03:05:14 - Horse goes back to sleep.\"\n\nSay we want to capture all occurences of time data (hour, minute, and second) as seperate entities.\n\npattern_1 = r\"(\\d\\d):(\\d\\d):(\\d\\d)\"\nre.findall(pattern_1, text)\n\n[('03', '04', '53'), ('03', '05', '14')]\n\n\nNotice how the given pattern has 3 capture groups, each specified by the regular expression (\\d\\d). We then use re.findall to return these capture groups, each as tuples containing 3 matches.\nThese regular expression capture groups can be different. We can use the (\\d{2}) shorthand to extract the same data.\n\npattern_2 = r\"(\\d\\d):(\\d\\d):(\\d{2})\"\nre.findall(pattern_2, text)\n\n[('03', '04', '53'), ('03', '05', '14')]\n\n\n\n\n6.6.3.2 Example 2\nWith the notion of capture groups, convince yourself how the following regular expression works.\n\nfirst = log_lines[0]\nfirst\n\n'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'\n\n\n\npattern = r'\\[(\\d+)\\/(\\w+)\\/(\\d+):(\\d+):(\\d+):(\\d+) (.+)\\]'\nday, month, year, hour, minute, second, time_zone = re.findall(pattern, first)[0]\nprint(day, month, year, hour, minute, second, time_zone)\n\n26 Jan 2014 10 47 58 -0800" + }, + { + "objectID": "regex/regex.html#limitations-of-regular-expressions", + "href": "regex/regex.html#limitations-of-regular-expressions", + "title": "6  Regular Expressions", + "section": "6.7 Limitations of Regular Expressions", + "text": "6.7 Limitations of Regular Expressions\nToday, we explored the capabilities of regular expressions in data wrangling with text data. However, there are a few things to be wary of.\nWriting regular expressions is like writing a program.\n\nNeed to know the syntax well.\nCan be easier to write than to read.\nCan be difficult to debug.\n\nRegular expressions are terrible at certain types of problems:\n\nFor parsing a hierarchical structure, such as JSON, use the json.load() parser, not RegEx!\nComplex features (e.g. valid email address).\nCounting (same number of instances of a and b). (impossible)\nComplex properties (palindromes, balanced parentheses). (impossible)\n\nUltimately, the goal is not to memorize all regular expressions. Rather, the aim is to:\n\nUnderstand what RegEx is capable of.\nParse and create RegEx, with a reference table\nUse vocabulary (metacharacter, escape character, groups, etc.) to describe regex metacharacters.\nDifferentiate between (), [], {}\nDesign your own character classes with , , […-…], ^, etc.\nUse python and pandas RegEx methods." } ] \ No newline at end of file