forked from seankross/the-unix-workbench
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-Working-with-Unix.Rmd
1555 lines (1241 loc) · 44.2 KB
/
04-Working-with-Unix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Working with Unix
> It is not the knowing that is difficult, but the doing. - Chinese proverb
## Self-Help
Each of the commands that we've discussed so far are thoroughly documented, and
you can view their documentation using the `man` command, where the first
argument to `man` is the command you're curious about. Let's take a look at the
documentation for `ls`:
```{r, engine='bash', eval=FALSE}
man ls
```
```
LS(1) BSD General Commands Manual LS(1)
NAME
ls -- list directory contents
SYNOPSIS
ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]
DESCRIPTION
For each operand that names a file of a type other than directory, ls
displays its name as well as any requested, associated information. For
:
```
The controls for navigating `man` pages are the same as they are for `less`.
I often use `man` pages for quickly searching for an option that I've forgotten.
Let's say that I forgot how to get `ls` to print a long list. After typing
`man ls` to open the page, type `/` in order to start a search. Then type the
word or phrase that you're searching for, in this case type in `long list` and
then press `Enter`. The page jumps to this entry:
```
-l (The lowercase letter ``ell''.) List in long format. (See below.)
If the output is to a terminal, a total sum for all the file sizes is
output on a line before the long listing.
```
Press the `n` key in order to search for the next occurrence of the word, and if
you want to go to the previous occurrence type `Shift` + `n`. This method of
searching also works with `less`. When you're finished looking at a `man` page
type `q` to get back to the prompt.
The `man` command works wonderfully when you know which command you want to look
up, but what if you've forgotten the name of the command you're looking for? You
can use `apropos` to search all of the available commands and their
descriptions. For example let's pretend that I forgot the name of my favorite
command line text editor. You could type `apropos editor` into the command line
which will print a list of results:
```{r, engine='bash', eval=FALSE}
apropos editor
```
```
## ed(1), red(1) - text editor
## nano(1) - Nano's ANOther editor, an enhanced free Pico clone
## sed(1) - stream editor
## vim(1) - Vi IMproved, a programmers text editor
```
The second result is `nano` which was just on the tip of my tongue! Both `man`
and `apropos` are useful when a search is only a few keystrokes away, but if
you're looking for detailed examples and explanations you're better off using
a search engine if you have access to a web browser.
### Summary
- Use `man` to look up the documentation for a command.
- If you can't think of the name of a command use `apropos` to search for a word
associated with that command.
- If you have access to a web browser, using a search engine might be better
than `man` or `apropos`.
### Exercises
1. Use `man` to look up the flag for human-readable output from `ls`.
2. Get help with `man` by typing `man man` into the console.
3. Wouldn't it be nice if there was a calendar command? Use `apropos` to look
for such a command, then use `man` to read about how that command works.
## Get Wild
Let's go into my `Photos` folder in my home directory and take a look around:
```{r, engine='bash', eval=FALSE}
pwd
```
```
## /Users/sean
```
```{r, engine='bash', eval=FALSE}
ls
```
```
## Code
## Documents
## Photos
## Desktop
## Music
## todo-2017-01-24.txt
```
```{r, engine='bash', eval=FALSE}
cd Photos
ls
```
```
## 2016-06-20-datasci01.png
## 2016-06-20-datasci02.png
## 2016-06-20-datasci03.png
## 2016-06-21-lab01.jpg
## 2016-06-21-lab02.jpg
## 2017-01-02-hiking01.jpg
## 2017-01-02-hiking02.jpg
## 2017-02-10-hiking01.jpg
## 2017-02-10-hiking02.jpg
```
I've just been dumping pictures and figures into this folder without organizing
them at all! Thankfully (in the words of Dr. Jenny Bryan) [I have an unwavering
commitment to the ISO 8601 date
standard](https://twitter.com/JennyBryan/status/816143967695687684) so at least
I know when these photos were taken. Instead of using `mv` to move around each
individual photo I can select groups of photos using the `*` wildcard. A
**wildcard** is a character that represents other characters, much like how
joker in a deck of cards can represent other cards in the deck. Wildcards are
a subset of metacharacters, a topic which we will discuss in detail later on in
this chapter. The `*` ("star") wildcard represents *zero or more of any
character*, and it can be used to match names of files and folders in the
command line. For example if I wanted to list all of the files in my Photos
directory which have a name that starts with "2017" I could do the following:
```{r, engine='bash', eval=FALSE}
ls 2017*
```
```
## 2017-01-02-hiking01.jpg
## 2017-01-02-hiking02.jpg
## 2017-02-10-hiking01.jpg
## 2017-02-10-hiking02.jpg
```
Only the files starting with "2017" are listed! The command `ls 2017*` literally
means: list the files that start with "2017" followed by zero or more of any
character. As you can imagine using wildcards is a powerful tool for working
with groups of files that are similarly named.
Let's walk through a few other examples of using the star wildcard. We could
only list the photos starting with "2016":
```{r, engine='bash', eval=FALSE}
ls 2016*
```
```
## 2016-06-20-datasci01.png
## 2016-06-20-datasci02.png
## 2016-06-20-datasci03.png
## 2016-06-21-lab01.jpg
## 2016-06-21-lab02.jpg
```
We could list only the files with names ending in `.jpg`:
```{r, engine='bash', eval=FALSE}
ls *.jpg
```
```
## 2016-06-21-lab01.jpg
## 2016-06-21-lab02.jpg
## 2017-01-02-hiking01.jpg
## 2017-01-02-hiking02.jpg
## 2017-02-10-hiking01.jpg
## 2017-02-10-hiking02.jpg
```
In the case above the file name can start with a sequence of zero or more of
any character, but the file name must end in `.jpg`.
Or we could also list only the first photos from each set of photos:
```{r, engine='bash', eval=FALSE}
ls *01.*
```
```
## 2016-06-20-datasci01.png
## 2016-06-21-lab01.jpg
## 2017-01-02-hiking01.jpg
## 2017-02-10-hiking01.jpg
```
All of the files above have names that are composed of a sequence of characters,
followed by the adjacent characters `01.`, followed by another sequence of
characters.
Notice that if I had entered `ls *01*` into the console every file would have
been listed since `01` is a part of all of the file names in my Photos
directory.
Let's organize these photos by year. First let's create one directory for
each year of photos:
```{r, engine='bash', eval=FALSE}
mkdir 2016
mkdir 2017
```
Now we can move the photos using wildcards:
```{r, engine='bash', eval=FALSE}
mv 2017-* 2017/
ls
```
```
## 2016
## 2016-06-20-datasci01.png
## 2016-06-20-datasci02.png
## 2016-06-20-datasci03.png
## 2016-06-21-lab01.jpg
## 2016-06-21-lab02.jpg
## 2017
```
Notice that I've moved all files that start with "2017-" into the 2017 folder!
Now let's do the same thing for files with names starting with "2016-":
```{r, engine='bash', eval=FALSE}
mv 2016-* 2016/
ls
```
```
## 2016
## 2017
```
Finally my photos are somewhat organized! Let's list the files in each directory
just to make sure all was moved as planned:
```{r, engine='bash', eval=FALSE}
ls 2016/
```
```
## 2016-06-20-datasci01.png
## 2016-06-20-datasci02.png
## 2016-06-20-datasci03.png
## 2016-06-21-lab01.jpg
## 2016-06-21-lab02.jpg
```
```{r, engine='bash', eval=FALSE}
ls 2017/
```
```
## 2017-01-02-hiking01.jpg
## 2017-01-02-hiking02.jpg
## 2017-02-10-hiking01.jpg
## 2017-02-10-hiking02.jpg
```
Looks good! There are a few more wildcards beyond the star wildcard which we'll
discuss in the next section where searching file names gets a little more
advanced.
### Summary
- Wildcards can represent many kinds and numbers of characters.
- The star wildcard (`*`) represents zero or more of any character.
- You can use wildcards on the command line in order to work with multiple files
and folders.
### Exercises
1. Before I organized the photos by year, what command would have listed all of
the photos of type `.png`?
2. Before I organized the photos by year, what command would have deleted all of
my hiking photos?
3. What series of commands would you use in order to put my figures for a data
science course and the pictures I took in the lab into their own folders?
## Search
### Regular Expressions
The ability to search through files and folders can greatly improve your
productivity using Unix. First we'll cover searching through text files.
I recently downloaded a list of the names of the states in the US which you
can find [here](http://seankross.com/notes/states.txt). Let's take a look at
this file:
```{r, engine='bash', eval=FALSE}
cd ~/Documents
ls
```
```
## canada.txt
## states.txt
```
```{r, engine='bash', eval=FALSE}
wc states.txt
```
```
## 50 60 472 states.txt
```
It makes sense that there are 50 lines, but it's interesting that there are 60
total words. Let's a take a peak at the beginning of the file:
```{r, engine='bash', eval=FALSE}
head states.txt
```
```
## Alabama
## Alaska
## Arizona
## Arkansas
## California
## Colorado
## Connecticut
## Delaware
## Florida
## Georgia
```
This file looks basically how you would expect it to look! You may recall from
Chapter 3 that the kind of shell that we're using is the bash shell. Bash
treats different kinds of data differently, and we'll dive deeper into data
types in Chapter 5. For now all you need to know is that text data are
called **strings**. A string could be a word, a sentence, a book, or a file or
folder name. One of the most effective ways to search through strings is to use
**regular expressions**. Regular expressions are strings that define patterns
in other strings. You can use regular expressions to search for a sub-string
contained within a larger string, or to replace one part of a string with
another string.
One of the most popular tools for searching through text files is `grep`. The
simplest use of `grep` requires two arguments: a regular expression and a text
file to search. Let's see a simple example of `grep` in action and then I'll
explain how it works:
```{r, engine='bash', eval=FALSE}
grep "x" states.txt
```
```
## New Mexico
## Texas
```
In the command above, the first argument to `grep` is the regular expression
`"x"`. The `"x"` regular expression represents one instance of the letter "x".
Every line of the `states.txt` file that contains at least one instance of the
letter "x" is printed to the console. As you can see New Mexico and Texas are
the only two state names that contain the letter "x". Let's try searching for
the letter "q" in all of the state names using `grep`:
```{r, engine='bash', eval=FALSE}
grep "q" states.txt
```
Nothing is printed to the console because the letter "q" isn't in any of the
state names. We can search for more than individual characters though. For
example the following command will search for the state names that contain the
word "New":
```{r, engine='bash', eval=FALSE}
grep "New" states.txt
```
```
## New Hampshire
## New Jersey
## New Mexico
## New York
```
In the previous case the regular expression we used was simply `"New"`, which
represents an occurrence of the string "New". Regular expressions are not
limited to just being individual characters or words, they can also represent
parts of words. For example I could search all of the state names that contain
the string "nia" with the following command:
```{r, engine='bash', eval=FALSE}
grep "nia" states.txt
```
```
## California
## Pennsylvania
## Virginia
## West Virginia
```
All of the state names above happen to end with the string "nia".
### Metacharacters
Regular expressions aren't just limited to searching with characters and
strings, the real power of regular expressions come from using
**metacharacters**. Remember that metacharacters are characters that can be used
to represent other characters. To take full advantage of all of the metacharacters
we should use `grep`'s cousin `egrep`, which just extends `grep`'s capabilities.
If you're using Ubuntu you should use `grep -P` instead of `egrep` for results
that are consistent with this chapter.
The first metacharacter we should discuss is the `"."` (period) metacharacter,
which represents *any* character. If for example I wanted to search `states.txt`
for the character "i", followed by any character, followed by the character "g"
I could do so with the following command:
```{r, engine='bash', eval=FALSE}
egrep "i.g" states.txt
```
```
## Virginia
## Washington
## West Virginia
## Wyoming
```
The regular expression "i.g" matches the sub-string "irg" in V*irg*inia, and
West V*irg*inia, and it matches the sub-string "ing" in Wash*ing*ton and
Wyom*ing*. The period metacharacter is a stand-in for the "r" in "irg" and the
"n" in "ing" in the example above. The period metacharacter is extremely liberal,
for example the command `egrep "." states.txt` would return every line of
states.txt since the regular expression `"."` would match one occurrence of any
character on every line (there's at least one character on every line).
Besides characters that can represent other
characters, there are also metacharacters called **quantifiers** which allow you
to specify the number of times a particular regular expression should appear in
a string. One of the most basic quantifiers is `"+"` (plus) which represents one
or more occurrences of the proceeding expression. For example the regular
expression "s+as" means: one or more "s" followed by "as". Let's see if any of
the state names match this expression:
```{r, engine='bash', eval=FALSE}
egrep "s+as" states.txt
```
```
## Arkansas
## Kansas
```
Both Arkan*sas* and Kan*sas* match the regular expression `"s+as"`. Besides the
plus metacharacter there's also the `"*"` (star) metacharacter which represents
zero or more occurrences of the preceding expression. Let's see what happens if
we change `"s+as"` to `"s*as"`:
```{r, engine='bash', eval=FALSE}
egrep "s*as" states.txt
```
```
## Alaska
## Arkansas
## Kansas
## Massachusetts
## Nebraska
## Texas
## Washington
```
As you can see the star metacharacter is much more liberal with respect to
matching since many more state names are matched by `"s*as"`. There are more
specific quantifies you can use beyond "zero or more" or "one or more"
occurrences of an expression. You can use curly brackets (`{ }`) to specify an
exact number of occurrences of an expression. For example the regular expression
`"s{2}"` specifies exactly two occurrences of the character "s". Let's try using
this regular expression:
```{r, engine='bash', eval=FALSE}
egrep "s{2}" states.txt
```
```
## Massachusetts
## Mississippi
## Missouri
## Tennessee
```
Take note that the regular expression `"s{2}"` is equivalent to the regular
expression `"ss"`. We could also search for state names that have between two
and three adjacent occurrences of the letter "s" with the regular expression
`"s{2,3}"`:
```{r, engine='bash', eval=FALSE}
egrep "s{2,3}" states.txt
```
```
## Massachusetts
## Mississippi
## Missouri
## Tennessee
```
Of course the results are the same because there aren't any states that have "s"
repeated three times.
You can use a **capturing group** in order to search for multiple occurrences of
a string. You can create capturing groups within regular expressions by using
parentheses (`"( )"`). For example if I wanted to search states.txt for the
string "iss" occurring twice in a state name I could use a capturing group and
a quantifier like so:
```{r, engine='bash', eval=FALSE}
egrep "(iss){2}" states.txt
```
```
## Mississippi
```
We could combine more quantifiers and capturing groups to dream up even more
complicated regular expressions. For example, the following regular expression
describes three occurrences of an "i" followed by two of any character:
```{r, engine='bash', eval=FALSE}
egrep "(i.{2}){3}" states.txt
```
```
## Mississippi
```
The complex regular expression above still only matches "Mississippi".
### Character Sets
For the next couple of examples we're going to need some text data beyond the
names of the states. Let's just create a short text file from the console:
```{r, engine='bash', eval=FALSE}
touch small.txt
echo "abcdefghijklmnopqrstuvwxyz" >> small.txt
echo "ABCDEFGHIJKLMNOPQRSTUVWXYZ" >> small.txt
echo "0123456789" >> small.txt
echo "aa bb cc" >> small.txt
echo "rhythms" >> small.txt
echo "xyz" >> small.txt
echo "abc" >> small.txt
echo "tragedy + time = humor" >> small.txt
echo "http://www.jhsph.edu/" >> small.txt
echo "#%&-=***=-&%#" >> small.txt
```
In addition to quantifiers there are also regular expressions for describing
sets of characters. The `\w` metacharacter corresponds to all "word" characters,
the `\d` metacharacter corresponds to all "number" characters, and the `\s`
metacharacter corresponds to all "space" characters. Let's take a look at using
each of these metacharacters on small.txt:
```{r, engine='bash', eval=FALSE}
egrep "\w" small.txt
```
```
## abcdefghijklmnopqrstuvwxyz
## ABCDEFGHIJKLMNOPQRSTUVWXYZ
## 0123456789
## aa bb cc
## rhythms
## xyz
## abc
## tragedy + time = humor
## http://www.jhsph.edu/
```
```{r, engine='bash', eval=FALSE}
egrep "\d" small.txt
```
```
## 0123456789
```
```{r, engine='bash', eval=FALSE}
egrep "\s" small.txt
```
```
## aa bb cc
## tragedy + time = humor
```
As you can see in the example above, the `\w` metacharacter matches all letters,
numbers, and even the underscore character (`_`). We can see the complement of
this grep by adding the `-v` flag to the command:
```{r, engine='bash', eval=FALSE}
egrep -v "\w" small.txt
```
```
## #%&-=***=-&%#
```
The `-v` flag (which stands for in**v**ert match) makes `grep` return all of the
lines not matched by the regular expression. Note that the character sets for
regular expressions also have their inverse sets: `\W` for non-words, `\D` for
non-digits, and `\S` for non-spaces. Let's take a look at using `\W`:
```{r, engine='bash', eval=FALSE}
egrep "\W" small.txt
```
```
## aa bb cc
## tragedy + time = humor
## http://www.jhsph.edu/
## #%&-=***=-&%#
```
The returned strings all contain non-word characters. Note the difference between
the results of using the invert flag `-v` versus using an inverse set regular
expression.
In addition to general character sets we can also create specific character
sets using square brackets (`[ ]`) and then including the characters we wish to
match in the square brackets. For example the regular expression for the set
of vowels is `[aeiou]`. You can also create a regular expression for the
complement of a set by including a caret (`^`) in the beginning of a set. For
example the regular expression `[^aeiou]` matches all characters that are not
vowels. Let's test both on small.txt:
```{r, engine='bash', eval=FALSE}
egrep "[aeiou]" small.txt
```
```
## abcdefghijklmnopqrstuvwxyz
## aa bb cc
## abc
## tragedy + time = humor
## http://www.jhsph.edu/
```
Notice that the word "rhythms" does not appear in the result (it's the longest
word without any vowels that I could think of).
```{r, engine='bash', eval=FALSE}
egrep "[^aeiou]" small.txt
```
```
## abcdefghijklmnopqrstuvwxyz
## ABCDEFGHIJKLMNOPQRSTUVWXYZ
## 0123456789
## aa bb cc
## rhythms
## xyz
## abc
## tragedy + time = humor
## http://www.jhsph.edu/
## #%&-=***=-&%#
```
Every line in the file is printed, because every line contains at least one
non-vowel! If you want to specify a range of characters you can use a hyphen
(`-`) inside of the square brackets. For example the regular expression `[e-q]`
matches all of the lowercase letters between "e" and "q" in the alphabet
inclusively. Case matters when you're specifying character sets, so if you
wanted to only match uppercase characters you'd need to use `[E-Q]`. To ignore
the case of your match you could combine the character sets with the `[e-qE-Q]`
regex (short for regular expression), or you could use the `-i` flag with `grep`
to **i**gnore the case. Note that the `-i` flag will work for any provided regular
expression, not just character sets. Let's take a look at some examples using
the regular expressions that we just described:
```{r, engine='bash', eval=FALSE}
egrep "[e-q]" small.txt
```
```
## abcdefghijklmnopqrstuvwxyz
## rhythms
## tragedy + time = humor
## http://www.jhsph.edu/
```
```{r, engine='bash', eval=FALSE}
egrep "[E-Q]" small.txt
```
```
## ABCDEFGHIJKLMNOPQRSTUVWXYZ
```
```{r, engine='bash', eval=FALSE}
egrep "[e-qE-Q]" small.txt
```
```
## abcdefghijklmnopqrstuvwxyz
## ABCDEFGHIJKLMNOPQRSTUVWXYZ
## rhythms
## tragedy + time = humor
## http://www.jhsph.edu/
```
### Escaping, Anchors, Odds, and Ends
One issue you may have thought about during our little exploration of regular
expressions is how to search for certain punctuation marks in text considering
that those same symbols are used as metacharacters! For example, how would you
find a plus sign (`+`) in a line of text since the plus sign is **also** a
metacharacter? The answer is simply using a backslash (`\`) before the plus sign
in a regex, in order to "escape" the metacharacter functionality. Here are a few
examples:
```{r, engine='bash', eval=FALSE}
egrep "\+" small.txt
```
```
## tragedy + time = humor
```
```{r, engine='bash', eval=FALSE}
egrep "\." small.txt
```
```
## http://www.jhsph.edu/
```
There are three more metacharacters that we should discuss, and two of them come
as a pair: the caret (`^`), which represents the start of a line, and the dollar
sign (`$`) which represents the end of line. These "anchor characters" only
match the beginning and ends of lines when coupled with other regular
expressions. For example, going back to looking at states.txt, I could search
for all of the state names that begin with "M" with the following command:
```{r, engine='bash', eval=FALSE}
egrep "^M" states.txt
```
```
## Maine
## Maryland
## Massachusetts
## Michigan
## Minnesota
## Mississippi
## Missouri
## Montana
```
Or we could search for all of the states that end in "s":
```{r, engine='bash', eval=FALSE}
egrep "s$" states.txt
```
```
## Arkansas
## Illinois
## Kansas
## Massachusetts
## Texas
```
There's a mnemonic that I love for remembering which metacharacter to use for
each anchor: "First you get the **power**, then you get the **money**." The
caret character is used for exponentiation in many programming languages, so
"power" (`^`) is used for the beginning of a line and "money" (`$`) is used for
the end of a line.
Finally, let's talk about the "or" metacharacter (`|`), which is also called the
"pipe" character. This metacharacter allows you to match either the regex on
the right or on the left side of the pipe. Let's take a look at a small example:
```{r, engine='bash', eval=FALSE}
egrep "North|South" states.txt
```
```
## North Carolina
## North Dakota
## South Carolina
## South Dakota
```
In the example above we're searching for lines of text that contain the words
"North" or "South". You can also use multiple pipe characters to, for example,
search for lines that contain the words for all of the cardinal directions:
```{r, engine='bash', eval=FALSE}
egrep "North|South|East|West" states.txt
```
```
## North Carolina
## North Dakota
## South Carolina
## South Dakota
## West Virginia
```
Just two more notes on `grep`: you can display the line number that a match
occurs on using the `-n` flag:
```{r, engine='bash', eval=FALSE}
egrep -n "t$" states.txt
```
```
## 7:Connecticut
## 45:Vermont
```
And you can also `grep` multiple files at once by providing multiple file
arguments:
```{r, engine='bash', eval=FALSE}
egrep "New" states.txt canada.txt
```
```
## states.txt:New Hampshire
## states.txt:New Jersey
## states.txt:New Mexico
## states.txt:New York
## canada.txt:Newfoundland and Labrador
## canada.txt:New Brunswick
```
You now have the power to do some pretty complicated string searching using
regular expressions! Imagine you wanted to search for all of the state names
that both begin and end with a vowel. Now you can:
```{r, engine='bash', eval=FALSE}
egrep "^[AEIOU]{1}.+[aeiou]{1}$" states.txt
```
```
## Alabama
## Alaska
## Arizona
## Idaho
## Indiana
## Iowa
## Ohio
## Oklahoma
```
I know there a many metacharacters to keep track of here so below I've included
a table with several of the metacharacters we've discussed in this chapter:
| Metacharacter | Meaning |
|--------------:|:-------------------------------------|
| . | Any Character |
| \\w | A Word |
| \\W | Not a Word |
| \\d | A Digit |
| \\D | Not a Digit |
| \\s | Whitespace |
| \\S | Not Whitespace |
| [def] | A Set of Characters |
| [^def] | Negation of Set |
| [e-q] | A Range of Characters |
| ^ | Beginning of String |
| $ | End of String |
| \\n | Newline |
| + | One or More of Previous |
| * | Zero or More of Previous |
| ? | Zero or One of Previous |
| | | Either the Previous or the Following |
| {6} | Exactly 6 of Previous |
| {4, 6} | Between 4 and 6 of Previous |
| {4, } | 4 or more of Previous |
If you want to experiment with writing regular expressions before you use them
I highly recommend playing around with http://regexr.com/.
### `find`
If you want to find the location of a file or the location of a group of files
you can use the `find` command. This command has a specific structure where
the first argument is the directory where you want to begin the search, and all
directories contained within that directory will also be searched. The first
argument is then followed by a flag that describes the method you want to use to
search. In this case we'll only be searching for a file by its name, so we'll
use the `-name` flag. The `-name` flag itself then takes an argument, the name
of the file that you're looking for. Let's go back to the home directory and
look for some files from there:
```{r, engine='bash', eval=FALSE}
cd
pwd
```
```
## /Users/sean
```
Let's start by looking for a file called states.txt:
```{r, engine='bash', eval=FALSE}
find . -name "states.txt"
```
```
## ./Documents/states.txt
```
Right where we expected it to be! Now let's try searching for all `.jpg` files:
```{r, engine='bash', eval=FALSE}
find . -name "*.jpg"
```
```
## ./Photos/2016-06-21-lab01.jpg
## ./Photos/2016-06-21-lab02.jpg
## ./Photos/2017/2017-01-02-hiking01.jpg
## ./Photos/2017/2017-01-02-hiking02.jpg
## ./Photos/2017/2017-02-10-hiking01.jpg
## ./Photos/2017/2017-02-10-hiking02.jpg
```
Good file hunting out there!
### Summary
- `grep` and `egrep` can be used along with regular expressions to search for
patterns of text in a file.
- Metacharacters are used in regular expressions to describe patterns of
characters.
- `find` can be used to search for the names of files in a directory.
### Exercises
1. Search `states.txt` and `canada.txt` for lines that contain the word "New".
2. Make five text files containing the names of states that don't contain one of
each of the five vowels.
3. Download the GitHub repository for this book and find out how many `.html`
files it contains.
## Configure
### History
Near the start of this book we discussed how you can browse the commands
that you recently entered into the prompt using the `Up` and `Down` arrow keys.
Bash keeps track of all of your recent commands, and you can browse your command
history two different ways. The commands that we've used since opening our
terminal can be accessed via the `history` command. Let's try it out:
```{r, engine='bash', eval=FALSE}
history
```
```
## ...
## 48 egrep "^M" states.txt
## 49 egrep "s$" states.txt
## 50 egrep "North|South" states.txt
## 51 egrep "North|South|East|West" states.txt
## 52 egrep -n "t$" states.txt
## 53 egrep "New" states.txt canada.txt
## 54 egrep "^[AEIOU]{1}.+[aeiou]{1}$" states.txt
## 55 cd
## 56 pwd
## 57 find . -name "states.txt"
## 58 find . -name "*.jpg"
## 59 history
```
We've had our terminal open for a while so there are tons of commands in our
history! Whenever we close a terminal our recent commands are written to the
`~/.bash_history` file. Let's a take a look at the beginning of this file:
```{r, engine='bash', eval=FALSE}
head -n 5 ~/.bash_history
```
```
## echo "Hello World!"