generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 16
/
03-vectors_and_arrays.Rmd
780 lines (536 loc) · 41.6 KB
/
03-vectors_and_arrays.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
# R `vector`s versus Numpy `array`s and Pandas' `Series`
This section is for describing the data types that let us store collections of elements that all **share the same type**. Data is very commonly stored in this fashion, so this section is quite important. Once we have one of these collection objects in a program, we will be interested in learning how to extract and modify different elements in the collection, as well as how to use the entire collection in an efficient calculation.
## Overview of R
In the previous section, I mentioned that R does not have scalar types--it just has [**vectors**](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Vector-objects
). So, whether you want to store one number (or `logical`, or `character`, or ...), or many numbers, you will need a `vector`.\index{vectors in R}
::: {.rmd-caution data-latex=""}
For many, the word "vector" evokes an impression that these objects are designed to be used for performing matrix arithmetic (e.g. inner products, transposes, etc.). You can perform these operations on `vector`s, but in my opinion, this preconception can be misleading, and I recommend avoiding it. Most of the things you can do with `vector`s in R have little to do with linear algebra!
:::
How do we create one of these? There are many ways. One common way is to read in elements from an external data set. Another way is to generate `vector`s from code.
```{R, collapse = TRUE}
1:10 # consecutive integers
seq(1,10,2) # arbitrary sequences
rep(2,5) # repeating numbers
# combine elements without relying on a pattern
c("5/2/2021", "5/3/2021", "5/4/2021")
# generate Gaussian random variables
rnorm(5)
```
`c()` is short for "combine". `seq()` and `rep()` are short for "sequence" and "replicate", respectively. `rnorm()` samples normal (or Gaussian) random variables. There is plenty more to learn about these functions, so I encourage you to take a look at their documentation.
::: {.rmd-details data-latex=""}
I should mention that functions such as `rnorm()` don't create truly random numbers, just *pseudorandom* ones. Pseudorandom numbers are nearly indecipherable from truly random ones, but the way the computer generates them is actually deterministic.
First, a *seed*, or starting number is chosen. Then, the *pseudorandom number generator (PRNG)* maps that number to another number. The sequence of all the numbers appears to be random, but is actually deterministic.
Sometimes you will be interested in setting the seed on your own because it is a cheap way of sharing and communicating data with others. If two people use the same starting seed, and the same PRNG, then they should simulate the same data. This can be important if you want to help other people reproduce the results of code you share. Most of the time, though, I don't set the seed, and I don't think about the distinction between random and pseudorandom numbers.
:::
## Overview of Python
If you want to store many elements of the same type (and size) in Python, you will probably need a Numpy `array`. Numpy is a highly-regarded third party library [@harris2020array] for Python. Its `array` objects store elements of the same type, just as R's `vector`s do.\index{Numpy arrays in Python}
There are five ways to create numpy arrays ([source](https://numpy.org/doc/stable/user/basics.creation.html)). Here are some examples that complement the examples from above.
```{python, collapse = TRUE}
import numpy as np
np.array([1,2,3])
np.arange(1,12,2)
np.random.normal(size=3)
```
Another option for storing a homogeneous collection of elements in Python is a [`Series` object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas-series) from the Pandas\index{Pandas} library\index{Pandas series}. The benefit of these is that they play nicely with Pandas' data frames (more information about Pandas' data frames can be found in \@ref(data-frames-in-python)), and that they have more flexibility with accessing elements by name ( see [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html#Series-as-generalized-NumPy-array) for more information ).
```{python, collapse = TRUE}
import pandas as pd
first = pd.Series([2, 4, 6])
second = pd.Series([2, 4, 6], index = ['a','b','c'])
print(first[0])
print(second['c'])
```
## Vectorization in R
An operation in R is **vectorized**\index{vectorization} if it applies to all of the elements of a `vector` at once. An operator that is not vectorized can only be applied to individual elements. In that case, the programmer would need to write more code to instruct the function to be applied to all of the elements of a vector. You should prefer writing vectorized code because it is usually easier to read. Moreover, many of these vectorized functions are written in compiled code, so they can often be much faster.
Arithmetic (e.g. `+`, `-`, `*`, `/`, `^`, `%%`, `%/%`, etc.) and logical (e.g. `!`, `|`, `&`, `>`, `>=`, `<`, `<=`, `==`, etc.) operators are commonly applied to one or two vectors. Arithmetic is usually performed *element-by-element*. Numeric vectors are converted to logical vectors if they need to be. Be careful of operator precedence if you seek to minimize your use of parentheses.
Note that there are an extraordinary amount of named functions (e.g. `sum()`, `length()`, `cumsum()`, etc.) that operate on entire `vector`s, as well. Here are some examples.
```{r, collapse = TRUE}
(1:3) * (1:3)
(1:3) == rev(1:3)
sin( (2*pi/3)*(1:4))
```
In the last example, there is **recycling**\index{recycling!|see{broadcasting}} happening. `(2*pi/3)` is taking three length-one vectors and producing another length-one vector. The resulting length-one vector is multiplied by a length four vector `1:4`. The single element in the length one vector gets *recycled* so that its value is multiplied by every element of `1:4`.
This makes sense most of the time, but sometimes recycling can be tricky. Notice that the following code does not produce an error--just a warning: `longer object length is not a multiple of shorter object length`. Try running it on your machine to confirm this.
```{r, collapse = TRUE, eval = FALSE}
(1:3) * (1:4)
```
## Vectorization in Python
The Python's Numpy library makes extensive use of vectorization as well. Vectorization in Numpy is accomplished with [**universal functions**](https://numpy.org/doc/stable/reference/ufuncs.html), or "ufuncs" for short\index{vectorization!universal functions}. Some ufuncs can be invoked using the same syntax as in R (e.g. `+`). You can also refer to function by its name (e.g. `np.sum()` instead of `+`). Mixing and matching is allowed, too.
Ufuncs are called *unary* if they take in one array, and *binary* if they take in two. At the moment, there are [fewer than $100$ available](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs), all performing either mathematical operations, boolean-emitting comparisons, or bit-twiddling operations. For an exhaustive list of Numpy's universal functions, [click here.](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) Here are some code examples.
```{python, collapse=TRUE}
np.arange(1,4)*np.arange(1,4)
np.zeros(5) > np.arange(-3,2)
np.exp( -.5 * np.linspace(-3, 3, 10)**2) / np.sqrt( 2 * np.pi)
```
Instead of calling it "recycling", Numpy calls reusing elements of a shorter array in a binary operation [**broadcasting**](https://numpy.org/devdocs/user/theory.broadcasting.html)\index{broadcasting|see{recycling}}. It's the same idea as in R, but in general, Python is stricter and disallows more scenarios. For example, try running the following code on your machine. You should receive an error: `ValueError: operands could not be broadcast together with shapes (2,) (3,) `.
```{python, collapse=TRUE, error = TRUE, eval = FALSE}
np.arange(1,3)*np.arange(1,4)
```
If you are working with string arrays, Numpy has a [`np.char` module with many useful functions](https://numpy.org/doc/stable/reference/routines.char.html#module-numpy.char).
```{python, collapse=TRUE, error = TRUE}
a = np.array(['a','b','c'])
np.char.upper(a)
```
Then there are the `Series` objects from Pandas. Ufuncs continue to work in the same way on `Series` objects, and they [respect common index values](https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pandas.html).
```{python, collapse = TRUE}
s1 = pd.Series(np.repeat(100,3))
s2 = pd.Series(np.repeat(10,3))
s1 + s2
```
If you feel more comfortable, and you want to coerce these `Series` objects to Numpy arrays before working with them, you can do that. For example, the following works.
```{python, collapse = TRUE}
s = pd.Series(np.linspace(-1,1,5))
np.exp(s.to_numpy())
```
In addition, `Series` objects possess many extra [attributes and methods](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas-series).
```{python, collapse = TRUE}
ints = pd.Series(np.arange(10))
ints.abs()
ints.mean()
ints.floordiv(2)
```
`Series` objects that have [text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#working-with-text-data) are a little bit different. For one, you have to access the `.str` attribute of the `Series` before calling any [vectorized methods](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html). Here are some examples.
```{python}
s = pd.Series(['a','b','c','33'])
s.dtype
s.str.isdigit()
s.str.replace('a', 'z')
```
String operations can be a big game changer, and we discuss text processing strategies in more detail in section \@ref(an-introduction-to-regular-expressions).
## Indexing Vectors in R
It is very common to want to extract or modify a subset of elements in a vector.\index{indexing!indexing vectors in R} There are a few ways to do this. All of the ways I discuss will involve the square bracket operator (i.e. `[]`). Feel free to retrieve the documentation by typing `?'['`.
```{R, collapse = TRUE}
allElements <- 1:6
allElements[seq(2,6,2)] # extract evens
allElements[-seq(2,6,2)] <- 99 # replace all odds with 99
allElements[allElements > 2] # get nums bigger than 2
```
To access the first element, we use the index `1`. To access the second, we use `2`, and so on. Also, the `-` sign tells R to remove elements. Both of these functionalities are *very different* from Python, as we will see shortly.
We can use names to access elements elements, too, but only if the elements are named.
```{R, collapse = TRUE}
sillyVec <- c("favorite"=1, "least favorite" = 2)
sillyVec['favorite']
```
## Indexing Numpy arrays
[Indexing Numpy arrays](https://numpy.org/doc/stable/user/basics.indexing.html) is very similar to indexing vectors in R\index{indexing!indexing Numpy arrays}. You use the square brackets, and you can do it with logical arrays or index arrays. There are some important differences, though.
For one, indexing is 0-based in Python. The `0`th element is the first element of an array. Another key difference is that the `-` isn't used to remove elements like it is in R, but rather to count backwards. Third, using one or two `:` inside square brackets is more flexible in Python. This is syntactic sugar for using the `slice()` function, which is similar to R's `seq()` function.
```{python, collapse=TRUE}
one_through_ten = np.arange(1, 11)
one_through_ten[np.array([2,3])]
one_through_ten[1:10:2] # evens
one_through_ten[::-1] # reversed
one_through_ten[-2] = 99 # second to last
one_through_ten
one_through_ten[one_through_ten > 3] # bigger than three
```
## Indexing Pandas' Series
At a minimum, there is little that is new that you *need* to learn to go from Numpy arrays to Pandas' Series objects.\index{indexing!indexing Pandas series} They still have the `[]` operator, and [many methods are shared across these two types](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). The following is almost equivalent to the code above, and the only apparent difference is that the results are printed a little differently.
```{python, collapse=TRUE}
import pandas as pd
one_through_ten = pd.Series(np.arange(1, 11))
one_through_ten[np.array([2,3])]
one_through_ten[1:10:2] # evens
one_through_ten[::-1] # reversed
one_through_ten[-2] = 99 # second to last
one_through_ten
one_through_ten[one_through_ten > 3] # bigger than three
one_through_ten.sum()
```
However, [Pandas' Series have `.loc` and `.iloc` methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing). We won't talk much about these two methods now, but they will become very important when we start to discuss Pandas' data frames in section \@ref(data-frames-in-python).
```{python, collapse=TRUE}
one_through_ten.iloc[2]
one_through_ten.loc[2]
```
## Some Gotchas
### Shallow versus Deep Copies
In R, assignment usually produces a **deep copy.** In the code below, we create `b` from `a`. If we modify `b`, these changes don't affect `a`. This takes up more memory, but our program is easier to follow as we don't have to keep track of connections between objects.
```{r, collapse = TRUE}
# in R
a <- c(1,2,3)
b <- a
b[1] <- 999
a # still the same!
```
With Numpy arrays in Python, ["shallow copies" can be created by simple assignment, or by explicitly constructing a **view**](https://numpy.org/devdocs/user/quickstart.html#copies-and-views). In the code below, `a`, `b`, `c`, and `d` all share the same data. If you modify one, you change all the others.\index{copy!view in Python} This can make the program more confusing, but on the other hand, it can also improve computational efficiency.
```{python, collapse = TRUE}
# in python
a = np.array([1,2,3])
b = a # b is an alias
c = a.view() # c is a view
d = a[:]
b[0] = 999
a # two names for the same object in memory
b
c
d
```
It's the same story with Pandas' Series objects. You're usually making a "shallow" copy.\index{copy!shallow copy}
```{python, collapse = TRUE}
# in python
import pandas as pd
s1 = pd.Series(np.array([100.0,200.0,300.0]))
s2 = s1
s3 = s1.view()
s4 = s1[:]
s1[0] = 999
s1
s2
s3
s4
```
If you want a "deep copy" in Python, you usually want a function or method called `copy()`. Use `np.copy` or [`np.ndarray.copy`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.copy.html#numpy-ndarray-copy) when you have a Numpy array.\index{copy!deep copy}
```{python, collapse = TRUE}
# in python
a = np.array([1,2,3])
b = np.copy(a)
c = a.copy()
b[0] = 999
a
b
c
```
Use [`pandas.Series.copy`](https://pandas.pydata.org/docs/reference/api/pandas.Series.copy.html#pandas-series-copy) with Pandas' Series objects. Make sure not to set the `deep` argument to `False`. Otherwise you'll get a shallow copy.
```{python, collapse = TRUE}
# in python
s1 = pd.Series(np.array([1,2,3]))
s2 = s1.copy()
s3 = s1.copy(deep=False)
s1[0] = 999
s1
s2
s3
```
### How R and Python Handle Missing Values
R has `NULL`, `NaN`, and `NA`. Python has `None` and `np.nan`. If your eyes are glazing over already and you're thinking "they all look like the same"--they are not.\index{missing data}
R's `NULL` and Python's `None` are similar. Both represent "nothingness." This is *not* the same as `0`, or an empty string, or `FALSE`/`False`. This is commonly used to detect if a user fails to pass in an argument to a function, or if a function fails to "return" (more information on functions can be found in section \@ref(functions)) anything meaningful.
In R, for example, if a function fails to return anything, then it actually returns a `NULL`. [A `NULL` object has its own type.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#NULL-object)
```{r, collapse = TRUE, warning=TRUE}
NULL == FALSE
NULL == NULL
# create a function that doesn't return anything
# more information on this later
doNothingFunc <- function(a){}
thing <- doNothingFunc() # call our new function
is.null(thing)
typeof(NULL)
```
In Python, we have the following.
```{python, collapse = TRUE}
None == False
None == None
# create a function that doesn't return anything
# more information on this later
def do_nothing_func():
pass
thing = do_nothing_func()
if thing is None:
print("thing is None!")
type(None)
```
"NaN" stands for "not a number." `NaN` is an object of type `double` in R, and `np.nan` is of type `float` in Python. It can come in handy when you (deliberately or accidentally) perform undefined calculations such as $0/0$ or $\infty / -\infty$.
```{r, collapse = TRUE}
# in R
0/0
Inf/Inf
is.na(0/0)
```
```{python, collapse = TRUE, error = TRUE}
# in Python
# 0/0
# the above yields a ZeroDivisionError
import numpy as np
np.inf/np.inf
np.isnan(np.nan)
```
"NA" is short for "not available." Missing data is a fact of life in data science. Observations are often missing in data sets, introduced after joining/merging data sets together (more on this in section \@ref(merging-or-joining-data-sets)), or arise from calculations involving underflow and overflow. There are many techniques designed to estimate quantities in the presence of missing data. When you code them up, you'll need to make sure you deal with `NA`s properly.
```{r, collapse = TRUE}
# in R
babyData <- c(0,-1,9,NA,21)
NA == TRUE
is.na(babyData)
typeof(NA)
```
Unfortunately, Python's support of an `NA`-like object is more limited. There is no `NA` object in base Python. And often `NaN`s will appear in place of an `NA`. There are a few useful tools, though. The Numpy library offers ["masked arrays"](https://numpy.org/devdocs/reference/maskedarray.html), for instance.
Also, as of version `1.0.0`, the [pandas library](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) has an experimental `pd.NA` object. However, they [warn](https://pandas.pydata.org/pandas-docs/dev/user_guide/missing_data.html#missing-data-na) that "the behaviour of `pd.NA` can still change without warning."
```{python, collapse = TRUE}
import numpy as np
import numpy.ma as ma
baby_data = ma.array([0,-1,9,-9999, 21]) # -9999 "stands for" missing
baby_data[3] = ma.masked
np.average(baby_data)
```
::: {.rmd-caution data-latex=""}
Be careful of using extreme values to stand in for what should be an `NA`. Be aware that some data providers will follow this strategy. I recommend that you avoid it yourself. Failing to represent a missing value correctly would lead to extremely wrong calculations!
:::
## An Introduction to Regular Expressions
We have already talked a little about how to work with text data in this book. Regarding Python, section \@ref(vectorization-in-python) mentioned that Pandas `Series` objects have a [`.str` accessor attribute](https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/text.html#string-methods) that has plenty of special methods that will work on string data. The same tools can be used whether or not these `Series` objects are contained in a Pandas `DataFrame`.
Regarding R, `character` `vector`s were first mentioned in section \@ref(overview-of-r). There are many functions that operate on these, too, regardless of whether they are held in a `data.frame`. The functions might be a little harder to find because they aren't methods, so pressing `<Tab>` and using your GUI's autocomplete feature doesn't reveal them as easily.
Suppose you're interested in replacing lowercase letters with uppercase ones, removing certain characters from text, or counting the number of times a certain expression appears. Up until now, as long as you can find a function or method that performs the task, you were doing just fine. If you need to do something with text data, there's probably a function for it.
Notice what all of these tasks have in common--they all require the ability to find patterns. When your patterns are easy to describe (e.g. find all lowercase "a"s), then all is well. What can make matters more complicated, however, is when the patterns are more difficult to describe (e.g. find all valid email addresses). That is why this section is primarily concerned with discussing **regular expressions,** which are a tool that help you describe the patterns in text [@rfords] [@pythonregexprs]\index{regular expressions}.
### Literal Characters versus Metacharacters
Every character in a regular expression is interpreted in one of two ways. Either it is interpreted as a
1. literal character, or as a
2. metacharacter.
If it is a literal character, then the character is the *literal* pattern. For example, in the regular expression "e", the character "e" has a literal interpretation. If you seek to capitalize all instances of "e" in the following phrase, you can do it pretty easily. As long as you know which function performs find-and-replace, you're good. The pattern is trivial to specify.
On the other hand, if I asked you to remove `$`s from price or salary data, you might have a little more difficulty. This is because `$` is a *metacharacter* in regular expressions, and so it has a special meaning.^[The dollar sign is useful if you only want to find certain patterns that finish a line. It takes the characters preceding it, and says, only look for that pattern if it comes at the end of a string.] In the examples below, if `$` is interpreted as a regular expression, the pattern will not be found at all, despite the prevalence of *literal* dollar signs.
There are a few functions in R that perform find-and-replace, but in this case, I use `gsub()`. In Pandas, I can use [`.str.replace()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html), to do this. Here are the examples that find patterns that are described by literal characters.
```{r, collapse = TRUE}
# in R
gsub(pattern = "e", replacement = "E",
x = "I don't need a regex for this!")
```
```{python, collapse = TRUE}
# in Python
import pandas as pd
s = pd.Series(["I don't need a regex for this!"])
s.str.replace(pat="e",repl="E")
```
On the other hand, here are a few examples that remove dollar signs. We generally have two options to recognize symbols that happen to be metacharacters.\index{escape character}
1. We can *escape* the dollar sign. That means you need to put a backslash (i.e. `\`) before the dollar sign. The backslash is a metacharacter looks at the character coming after it, and it either removes the special meaning from a metacharacter, or adds special meaning to a literal character.
2. Alternatively, we can tell the function to ignore regular expressions. `gsub()` can take `fixed=TRUE`, and `.str.replace()` can take `regex=False`.
```{python, collapse = TRUE}
# in Python
pd.Series(["$100, $200"]).str.replace(pat="$",repl="",regex=False)
pd.Series(["$100, $200"]).str.replace(pat="\$",repl="")
```
```{r, collapse = TRUE}
# in R
gsub(pattern = "$", replacement = "", x = c("$100, $200"), fixed=TRUE)
stringr::str_remove_all(c("$100, $200"), pattern = "\\$")
```
### The Trouble With Backslashes: Escape Sequences
You might have noticed above and gotten confused--sometimes in Python and in R, we need *two* backslashes instead of *one*. This is because backslashes have another purpose that can complicate our using them to escape metacharacters. They also help us write untypeable characters, also known as **escape sequences**\index{escape sequence}. We need to be able to do this even if we aren't using regular expressions.
For instance, consider the way we type the "newline" character. Even though it is understood by the computer as one character, it takes us two keystrokes to write it with our keyboard. `\` is one character, and `n` is another, but together they are one!
```{r, collapse = TRUE}
nchar('\n') #in R
```
```{python, collapse = TRUE}
len('\n') #in Python
```
`str`s in Python and `character` `vectors` in R will look for these combinations by default. When we specify regular expressions with strings, the backslashes will be used first for this purpose. Their regular expression purpose is a second priority.
The reason we used `\\$` in the above example is to escape\index{escape character} the second backslash. `\$` is not a special character, but Python and R will handle it differently. Python will not recognize it, and it won't complain that it didn't. On the other hand, R will throw an error that it can't recognize it.
```{python, collapse = TRUE}
len('\$') # in Python, not special
```
```{r, error = TRUE, collapse = TRUE}
nchar('\$') # R gets confused
```
There is another way to deal with this issue--**raw strings!** Raw strings make life easier because they do not interpret backslashes as the beginning of escape sequences. You can make them in R and Python by putting an "r" in front of the quotation marks. However, it is slightly more complicated in R because you need a delimiter pair inside the quotation marks--for more information type `?Quotes` in your R console.
```{python, collapse = TRUE}
len(r'\n') # in Python
```
```{r, error = TRUE, collapse = TRUE}
nchar(r'{\$}') # in R
```
### More Examples of Using Regular Expressions
Regular expressions that match many different types of characters are often very useful--these are called **character classes.** For example, `.` represents any character except a newline, `\d` represents any digit, and `\s` represents any whitespace character. You can sometimes capitalize the letters in the regular expression to get the opposite pattern.
```{r, collapse=TRUE}
# anonymize phone numbers in R
gsub(pattern = r"{\d}", replacement = "X", x = "(202)555-0191")
```
```{python, collapse=TRUE}
# remove everything that isn't a number in Python
pd.Series(["$100"]).str.replace(pat="\D",repl="")
```
Many character classes feature an opening and closing square brackets. For instance, `[1-5]` matches any digit between $1$ and $5$ (inclusive), `[aeiouy]` matches any lowercase vowel, and `[\^\-]` matches either `^` or `-` (we had to escape these two metacharacters because we are only interested in the literal pattern).
```{r, collapse=TRUE}
# remove vowels in R
gsub(pattern = "[aeiouy]", replacement = "",
x = "Can you still read this?")
```
Concatenating two patterns, one after another, forms a more specific pattern to be matched.
```{python, collapse=TRUE}
# convert date formats in Python
s = pd.Series(["2021-10-23","2021:10:23"])
s.str.replace(pat="[:\-]",repl="/")
```
If you would like one pattern or another to appear, you can use the **alternation operator** `|`.
```{python, collapse=TRUE}
# one or the other in Python
pd.Series(["this","that"]).str.contains(pat="this|that")
```
In addition to concatenation, alternation, and grouping, there are more general ways to *quantify* how many times the desired pattern will appear. `?` means zero or one time, `*` means zero or more, `+` will mean one or more, and there are a variety of ways to be even more specific with curly braces (e.g. `{3,17}` means anywhere from three to seventeen times).
```{r, collapse=TRUE}
# detect double o's in R
grepl(pattern = "o{2}", x = c("look","book","box", "hellooo"))
```
```{python, collapse=TRUE}
# detects phone number formats in Python
s = pd.Series(["(202)555-0191","202-555-0191"])
s.str.contains(pat=r"\(\d{3}\)\d{3}-\d{4}")
```
Notice in the double "o" example, the word with three matched. To describe that not being desirable requires the ability to *look ahead* of the match, to the next character, and evaluate that. You can look ahead, or behind, and make assertions about what patterns are required or disallowed.
| Lookaround Regex | Meaning |
|-----------|-------------------|
`(?=pattern)` | Positive looking ahead for `pattern`
`(?!pattern)` | Negative looking ahead for `pattern`
`(?<=pattern)` | Positive looking behind for `pattern`
`(?<!pattern)` | Negative looking behind for `pattern`
After `oo` we specify `(?!o)` to disallow a third, trailing `o`.
```{python, collapse=TRUE}
# exactly two "o"s in Python?
s = pd.Series(["look","book","box", "hellooo"])
s.str.contains(pat="oo(?!o)")
```
However, this does not successfully remove `"hellooo"` because it will match on the *last* two "o"s of the word. To prevent this, we can prepend a `(?<!o)`, which disallows a leading "o", as well. In R, we also have to specify `perl=TRUE` to use Perl-compatible regular expressions.
```{r, collapse=TRUE}
# exactly two "o"s in R
grep(pattern = "(?<!o)oo(?!o)",
x = c("look","book","box", "hellooo"), perl = TRUE)
```
We also mention *anchoring.* If you only want to find a pattern at the beginning of text, use `^`. If you only want to find a pattern at the end of text, use `$`. Below we use [`.str.extract()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html), whose documentation makes reference to *capture groups.*\index{capture groups} Capture groups are just regular expressions grouped inside parentheses (e.g. `(this)`).
```{python, collapse=TRUE}
# extract emails with Pandas
s = pd.Series(["my email is [email protected]", "[email protected] is hidden"])
s.str.extract(r".( [a-z]+@[a-z]+\.com$)")
```
## Exercises
### R Questions
1.
Let's flip some coins! Generate a thousand flips of a fair coin. Use `rbinom`, and let heads be coded as `1` and tails coded as `0`.
a) Assign the thousand raw coin flips to a variable `flips`. Make sure the elements are integers, and make sure you flip a "fair" coin ($p=.5$).
b) Create a length `1000` `logical` `vector` called `isHeads`. Whenever you get a heads, make sure the corresponding element is `TRUE` and `FALSE` otherwise.
c) Create a variable called `numHeads` by tallying up the number of heads.
d) Calculate the percent of time that the number changes in `flips`. Assign your number to `acceptanceRate`. Try to write only one line of code to do this.
2.
Compute the elements of the tenth order Taylor approximation to $\exp(3)$ and store them in `taylorElems`. Do not sum them. Use only one expression, and do not use any loop. The approximation is,
$$
1 + 3 + 3^2/2! + 3^3/3! + \cdots 3^{10}/10!
$$
You want to store each of those eleven numbers separately in a `numeric` `vector`.
3.
Do the following.
a) Create a vector called `nums` that contains all consecutive integers from $-100$ to $100$.
b) Create a `logical` `vector` that has the same length as the above, and contains `TRUE` whenever an element of the above is even, and `FALSE` otherwise. Call it `myLogicals`.
c) Assign the total number of `TRUE`s to `totalEven`.
d) Create a `vector` called `evens` of all even numbers from the above `vector`.
e) Create a `vector` called `reverse` that holds the reversed elements of `nums`.
4.
Let's say we wanted to calculate the following sum $\sum_{i=1}^N x_i$. If $N$ is large, or most of the $x_i$s are large, then we might bump up against the largest allowable number. This is the problem of *overflow*. The biggest integer and biggest floating point can be recovered by typing `.Machine$integer.max` and `.Machine$double.xmax`, respectively.
a) Assign `sumThese` to `exp(rep(1000,10))`. Are they finite? Can you sum them? If everything is all good, assign `TRUE` to `allGood`.
b) Theoretically, is the logarithm of the sum less than `.Machine$double.xmax`? Assign `TRUE` or `FALSE` to `noOverflowIssue`.
c) Assign the *naive* log-sum of these to `naiveLogSum`. Is the naive log sum finite?
d) Compute `betterSum`, one that doesn't overflow, using the *log-sum-exp* trick:
$$
\log\left( \sum_{i=1}^{10} x_i \right) = \log\left( \sum_{i=1}^{10} \exp[ \log(x_i) - m] \right) + m
$$
$m$ is usually chosen to be $\max_i \log x_i$. This is the same formula as above, which is nice. You can use the same code to combat both overflow and underflow.
e) If you're writing code, and you have a bunch of very large numbers, is it better to store those numbers, or store the logarithm of those numbers? Assign your answer to `whichBetter`. Use either the phrase `"logs"` or `"nologs"`.
5.
Say you have a `vector` of prices of some financial asset:
```{R, collapse = TRUE}
prices <- c(100.10, 95.98, 100.01, 99.87)
```
a) Use the natural logarithm and convert this vector into a vector of *log returns*. Call the variable `logReturns`. If $p_t$ is the price at time $t$, the log return ending at time $t$ is
\begin{equation}
r_t = \log \left( \frac{p_t}{p_{t-1}} \right) = \log p_t - \log p_{t-1}
\end{equation}
b) Do the same for *arithmetic returns*. These are regular percent changes if you scale by $100$. Call the variable `arithReturns`. The mathematical formula you need is
\begin{equation}
a_t = \left( \frac{p_t - p_{t-1} }{p_{t-1}} \right) \times 100
\end{equation}
6.
Consider the **mixture density** $f(y) = \int f(y \mid x) f(x) dx$ where
\begin{equation}
Y \mid X = x \sim \text{Normal}(0, x^2)
\end{equation}
and
\begin{equation}
X \sim \text{half-Cauchy}(0, 1).
\end{equation}
This distribution is a special case of a prior distribution that is used in Bayesian statistics [@horseshoe]. Note that the second parameter of the Normal distribution is its variance, not its standard deviation.
Suppose further that you are interested in calculating the probability that one of these random variables ends up being too far from the median:
\begin{equation}
\mathbb{P}[|Y| > 1] = \int_{y : |y| > 1} f(y)dy = \int_{y : |y| > 1} \int_{-\infty}^\infty f(y \mid x) f(x) dx dy.
\end{equation}
The following steps will demonstrate how you can use the **Monte-Carlo** [@monte-carlo-stat-methods] method to approximate this probability.
a) Simulate $X_1, \ldots, X_{5000}$ from a $\text{half-Cauchy}(0, 1)$ and call these samples `xSamps`. Hint: you can simulate from a $t$ distribution with one degree of freedom to sample from a Cauchy. Once you have regular Cauchy samples, take the absolute value of each one.
b) Simulate $Y_1 \mid X_1, \ldots, Y_{5000} \mid X_{5000}$ and call the samples `ySamps`.
c) Calculate the approximate probability using `ySamps` and call it `approxProbDev1`.
d) Why is simply "ignoring" `xSamps` (i.e. not using it in the averaging part of the computation), the samples you condition on, "equivalent" to "integrating out $x$"? Store a string response as a length $1$ character vector called `integratingOutResp`.
e) Calculate another **Rao-Blackwellized** Monte Carlo estimate of $\mathbb{P}[|Y| > 1]$ from `xSamps`. Call it `approxProbDev2`. Hint: $\mathbb{P}[|Y| > 1] = \mathbb{E}[\mathbb{P}(|Y| > 1 \mid X) ]$. Calculate $\mathbb{P}(|Y| > 1 \mid X=x)$ with pencil and paper, notice it is a function in $x$, apply that function to each of `xSamps`, and average all of it together.
f) Are you able to calculate an exact solution to $\mathbb{P}[|Y| > 1]$?
7.
Store the ordered uppercase letters of the alphabet in a length $26$ `character` `vector` called `myUpcaseLetters`. Do not hardcode this. Use a function, along with the variable `letters`.
a) Create a new vector called `withReplacements` that's the same as the previous `vector`, but replace all vowels with `"---"`. Again, do not hardcode this. Find a function that searches for patterns and performs a replacement whenever that pattern is found.
b) Create a length $26$ logical vector that is `TRUE` whenever an element of `letters` is a consonant, and `FALSE` everywhere else. Call it `consonant`.
### Python Questions
1.
Let's flip some coins (again)! Generate a thousand flips of a fair coin. Use `np.random.binomial`, and let heads be coded as `1` and tails coded as `0`.
a) Assign the thousand raw coin flips to a variable `flips`. Make sure the elements are integers, and make sure you flip a "fair" coin ($p=.5$).
b) Create a length `1000` `list` of `bool`s called `is_heads`. Whenever you get a heads, make sure the corresponding element is `True` and `False` otherwise.
c) Create a variable called `num_heads` by tallying up the number of heads.
d) Calculate the percent of time that the number changes in `flips`. Assign your number to `acceptance_rate`. Try to write only one line of code to do this.
2.
Create a Numpy `array` containing the numbers $1/2, 1/4, 1/8, \ldots, 1/1024$ Make sure to call it `my_array`.
3.
Do the following:
a) Create a `np.array` called `nums` that contains one hundred equally-spaced numbers starting from $-100$ and going to $100$ (inclusive).
b) Create a `bool` `np.array` that has the same length as the above, and contains `TRUE` whenever an element of the above is less than ten units away from $0$, and `FALSE` otherwise. Call it `my_logicals`.
c) Assign the total number of `True`s to `total_close`.
d) Create a `np.array` called `evens` of all even numbers from the above `np.array` (even numbers are necessarily integers).
e) Create a `np.array` called `reverse` that holds the reversed elements of `nums`.
4.
Let's say we wanted to calculate the following sum $\sum_{i=1}^N x_i$. We run into problems when this sum is close to $0$, too. This is the problem of *underflow*. The smallest positive floating point can be recovered by typing `np.nextafter(np.float64(0),np.float64(1))`.
a) Assign `sum_these` to the length ten array $(e^{-1000}, \ldots, e^{-1000})$. Use ` np.exp(np.repeat(-1000,10))`. Are the elements nonzero? Can you sum them? Is the sum correct? If everything is all good, assign `True` to `all_good`.
b) Theoretically, for which range of positive numbers is the logarithm of the number farther from $0$ than the number itself? Assign the lower bound to `lower_bound`, and the upper bound to `upper_bound`. Hint: `lower_bound` is $0$ because we're only looking at positive numbers, and because the logarithm is $-\infty$.
c) Assign the *naive* log-sum of `sum_these` to `naive_log_sum`. Is the naive log sum finite on your computer? Should it be?
d) Compute `better_sum`, one that doesn't underflow, using the *log-sum-exp* trick. This one should be bounded away from $-\infty$.
$$
\log\left( \sum_{i=1}^{10} x_i \right) = \log\left( \sum_{i=1}^{10} \exp[ \log(x_i) - m] \right) + m =
$$
$m$ is usually chosen to be $\max_i \log x_i$
e) If you're writing code, and you have a bunch of very small positive numbers (e.g. probabilities, densities, etc.), is it better to store those small numbers, or store the logarithm of those numbers? Assign your answer to `which_better`. Use either the phrase `"logs"` or `"nologs"`.
5.
Use `pd.read_csv` to correctly read in `"2013-10_Citi_Bike_trip_data_20K.csv"` as a data frame called `my_df`. Make sure to read `autograding_tips.html`.
a) Extract the `"starttime"` column into a separate `Series` called `s_times`.
b) Extract date strings of those elements into a `Series` called `date_strings`.
c) Extract time strings of those elements into a `Series` called `time_strings`.
6.
We will make use of the **Monte Carlo** method below. It is a technique to approximate expectations and probabilities. If $n$ is a large number, and $X_1, \ldots, X_n$ is a random sample drawn from the distribution of interest, then
\begin{equation}
\mathbb{P}(X > 6) \approx \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i > 6).
\end{equation}
If you haven't seen an **indicator function** before, it is defined as
\begin{equation}
\mathbf{1}(X_i > 6)
=
\begin{cases}
1 & X_i > 6 \\
0 & X_i \le 6
\end{cases}.
\end{equation}
If you wanted to visualize it, $\mathbf{1}(x > 6)$ looks like this.
```{r, echo=FALSE, out.width="50%", collapse = TRUE, fig.cap = "An Indicator Function", fig.align='center'}
leX <- seq(6-5,6+5, .01)
plot(leX, leX > 6, type = "l", xlab = "x", ylab = "1(x>6)")
```
So, the sum in this expression is just a count of the number of elements that are greater than $6$.
a) Evaluate exactly the probability that a normal random variable with mean $5$ and standard deviation $6$ is greater than $6$. Assign it to the variable `exact_exceedance_prob` in Python.
b) Simulate $1e3$ times from a standard normal distribution (mean 0 and variance 1). Call the samples `stand_norm_samps`.
c) Calculate a Monte Carlo estimate of $\mathbb{P}(X > 6)$ from these samples. Call it `approx_exceedance_prob1`.
d) Simulate $1e3$ times from a normal distribution with mean $5$ and standard deviation $6$. Call the samples `norm_samps`. Don't use the old samples in any way.
e) Calculate a Monte Carlo estimate of $\mathbb{P}(X > 6)$ from these new `norm_samps`. Call it `approx_exceedance_prob2`.
7.
Alternatively, we can approximate expectations using the same technique as above. If $\mathbb{E}[g(X)]$ exists, $n$ is a large number, and $W_1, \ldots, W_n$ is a random sample drawn from the distribution of interest, then
\begin{equation}
\mathbb{E}[g(W)] \approx \frac{1}{n}\sum_{i=1}^n g(W_i).
\end{equation}
Here's a new distribution. It is a **mixture distribution**, specifically a **finite mixture of normal distributions**: $f(y) = f(y \mid X=1)P(X=1) + f(y \mid X=0)P(X=0)$ where
\begin{equation}
Y \mid X=0 \sim \text{Normal}(0, 2) \\
Y \mid X=1 \sim \text{Normal}(10, 2)
\end{equation}
and
\begin{equation}
X \sim \text{Bernoulli}(.5).
\end{equation}
Both $f(y \mid X=0)$ and $f(y \mid X=1)$ are bell-curved, and $f(y)$ looks like this
```{r, echo=FALSE, out.width="50%", fig.align = 'center', fig.cap = "The Marginal Density of Y"}
leX <- seq(5-20,5+20, .1)
plot(leX, .5*dnorm(leX, 0, sqrt(2)) + .5*dnorm(leX, 10, sqrt(2)), type = "l", xlab = "y", ylab = "f(y)")
```
a) Evaluate exactly $\mathbb{E}[Y] = \mathbb{E}[ \mathbb{E}(Y \mid X) ]$. Assign it to the variable `exact_mean` in Python.
b) Simulate $1e3$ times from the Bernoulli distribution. Call the samples `bernoulli_flips`
c) Simulate $Y_1 \mid X_1, \ldots, Y_{1000} \mid X_{1000}$ and call the samples `cond_norm_samps`.
d) Calculate a Monte Carlo estimate of $\mathbb{E}[Y]$ from `cond_norm_samps`. Call it `approx_ave_1`. Why is simply "ignoring" `bernoulli_flips`, the samples you condition on, "equivalent" to "integrating them out?"
e) Calculate a **Rao-Blackwellized** Monte Carlo estimate of $\mathbb{E}[Y]$ from `bernoulli_flips`. Call it `approx_ave_2`. Hint: $\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}(Y \mid X) ]$. Calculate $\mathbb{E}(Y \mid X_i)$ exactly, and evaluate that function on each $X_i$ sample, and then average them together. Rao-Blackwellization is a variance-reduction technique that allows you come up with lower-variance estimates given a fixed computational budget.