forked from swcarpentry/r-novice-inflammation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-cmdline.Rmd
367 lines (277 loc) · 14.3 KB
/
05-cmdline.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---
layout: page
title: Programming with R
subtitle: Command-Line Programs
minutes: 30
---
```{r, include = FALSE}
source("tools/chunk-options.R")
```
> ## Learning Objectives {.objectives}
>
> * Use the values of command-line arguments in a program.
> * Handle flags and files separately in a command-line program.
> * Read data from standard input in a program so that it can be used in a pipeline.
The R Console and other interactive tools like RStudio are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files.
In order to do that, we need to make our programs work like other Unix command-line tools.
For example, we may want a program that reads a data set and prints the average inflammation per patient:
~~~
$ Rscript readings.R --mean data/inflammation-01.csv
5.45
5.425
6.1
...
6.4
7.05
5.9
~~~
but we might also want to look at the minimum of the first four lines
~~~
$ head -4 data/inflammation-01.csv | Rscript readings.R --min
~~~
or the maximum inflammations in several files one after another:
~~~
$ Rscript readings.R --max data/inflammation-*.csv
~~~
Our overall requirements are:
1. If no filename is given on the command line, read data from [standard input](reference.html#standard-input-(stdin)).
2. If one or more filenames are given, read data from them and report statistics for each file separately.
3. Use the `--min`, `--mean`, or `--max` flag to determine what statistic to print.
To make this work, we need to know how to handle command-line arguments in a program, and how to get at standard input.
We'll tackle these questions in turn below.
### Command-Line Arguments
Using the text editor of your choice, save the following line of code in a text file called `session-info.R`:
```{r echo=FALSE, engine='bash'}
cat session-info.R
```
The function, `sessionInfo`, outputs the version of R you are running as well as the type of computer you are using (as well as the versions of the packages that have been loaded).
This is very useful information to include when asking others for help with your R code.
Now we can run the code in the file we created from the Unix Shell using `Rscript`:
```{r engine='bash'}
Rscript session-info.R
```
> ## Tip {.callout}
>
> If that did not work, remember that you must be in the correct directory.
> You can determine which directory you are currently in using `pwd` and change to a different directory using `cd`.
> For a review, see this [lesson](https://swcarpentry.github.io/shell-novice/01-filedir.html).
Now let's create another script that does something more interesting. Write the following lines in a file named `print-args.R`:
```{r echo=FALSE, engine='bash'}
cat print-args.R
```
The function `commandArgs` extracts all the command line arguments and returns them as a vector.
The function `cat`, similar to the `cat` of the Unix Shell, outputs the contents of the variable.
Since we did not specify a filename for writing, `cat` sends the output to [standard output](reference.html#standard-output-(stdout)), which we can then pipe to other Unix functions.
Because we set the argument `sep` to `"\n"`, which is the symbol to start a new line, each element of the vector is printed on its own line.
Let's see what happens when we run this program in the Unix Shell:
```{r engine='bash'}
Rscript print-args.R
```
From this output, we learn that `Rscript` is just a convenience command for running R scripts.
The first argument in the vector is the path to the `R` executable.
The following are all command-line arguments that affect the behavior of R.
From the R help file:
* `--slave`: Make R run as quietly as possible
* `--no-restore`: Don't restore anything that was created during the R session
* `--file`: Run this file
* `--args`: Pass these arguments to the file being run
Thus running a file with Rscript is an easier way to run the following:
```{r engine='bash'}
R --slave --no-restore --file=print-args.R --args
```
If we run it with a few arguments, however:
```{r engine='bash'}
Rscript print-args.R first second third
```
then `commandArgs` adds each of those arguments to the vector it returns.
Since the first elements of the vector are always the same, we can tell `commandArgs` to only return the arguments that come after `--args`.
Let's update `print-args.R` and save it as `print-args-trailing.R`:
```{r echo=FALSE, engine='bash'}
cat print-args-trailing.R
```
And then run `print-args-trailing` from the Unix Shell:
```{r engine='bash'}
Rscript print-args-trailing.R first second third
```
Now `commandArgs` returns only the arguments that we listed after `print-args-trailing.R`.
With this in hand, let's build a version of `readings.R` that always prints the per-patient (per-row) mean of a single data file.
The first step is to write a function that outlines our implementation, and a placeholder for the function that does the actual work.
By convention this function is usually called `main`, though we can call it whatever we want.
Write the following code in a file called `readings-01.R`:
```{r echo=FALSE, engine='bash'}
cat readings-01.R
```
This function gets the name of the file to process from the first element returned by `commandArgs`.
Here's a simple test to run from the Unix Shell:
```{r engine='bash'}
Rscript readings-01.R data/inflammation-01.csv
```
There is no output because we have defined a function, but haven't actually called it.
Let's add a call to `main` and save it as `readings-02.R`:
```{r echo=FALSE, engine='bash'}
cat readings-02.R
```
```{r engine='bash'}
Rscript readings-02.R data/inflammation-01.csv
```
> ## Challenge - A simple command line program {.challenge}
>
> + Write a command-line program that does addition and subtraction.
> **Hint:** Everything argument read from the command-line is interpreted as a character [string](reference.html#string).
> You can convert from a string to a number using the function `as.numeric`.
> ```{r, engine='bash'}
> Rscript arith.R 1 + 2
> ```
>
> ```{r, engine='bash'}
> Rscript arith.R 3 - 4
> ```
>
> + What goes wrong if you try to add multiplication using `*` to the program?
>
> <!-- second-answer -->
> <!-- The * is a wildcard character in the Unix Shell. -->
> <!-- Thus all the files in the current working directory are included as arguments to arith.R. -->
>
> + Using the function `list.files` introduced in a previous [lesson](03-loops-R.html), write a command-line program, `find-pattern.R`, that lists all the files in the current directory that contain a specific pattern:
>
> ```{r, engine='bash'}
> # For example, searching for the pattern "print-args" returns the two scripts we
> # wrote earlier
> Rscript find-pattern.R print-args
> ```
>
> ```{r third-answer, include=FALSE, engine='bash'}
> cat find-pattern.R
> ```
### Handling Multiple Files
The next step is to teach our program how to handle multiple files.
Since 60 lines of output per file is a lot to page through, we'll start by using three smaller files, each of which has three days of data for two patients.
Let's investigate them from the Unix Shell:
```{r, engine='bash'}
ls data/small-*.csv
```
```{r, engine='bash'}
cat data/small-01.csv
```
```{r, engine='bash'}
Rscript readings-02.R data/small-01.csv
```
Using small data files as input also allows us to check our results more easily: here, for example, we can see that our program is calculating the mean correctly for each line, whereas we were really taking it on faith before.
This is yet another rule of programming: test the simple things first.
We want our program to process each file separately, so we need a loop that executes once for each filename.
If we specify the files on the command line, the filenames will be returned by `commandArgs(trailingOnly = TRUE)`.
We'll need to handle an unknown number of filenames, since our program could be run for any number of files.
The solution is to loop over the vector returned by `commandArgs(trailingOnly = TRUE)`.
Here's our changed program, which we'll save as `readings-03.R`:
```{r echo=FALSE, engine='bash'}
cat readings-03.R
```
and here it is in action:
```{r, engine='bash'}
Rscript readings-03.R data/small-01.csv data/small-02.csv
```
**Note**: at this point, we have created three versions of our script called `readings-01.R`, `readings-02.R`, and `readings-03.R`.
We wouldn't do this in real life: instead, we would have one file called `readings.R` that we committed to version control every time we got an enhancement working.
For teaching, though, we need all the successive versions side by side.
> ## Challenge - A command line program with arguments {.challenge}
>
> + Write a program called `check.R` that takes the names of one or more inflammation data files as arguments and checks that all the files have the same number of rows and columns.
> What is the best way to test your program?
```{r fourth-answer, include=FALSE, engine='bash'}
cat check.R
```
```{r testing, include=FALSE, engine='bash'}
# For testing that it works.
# Should pass check:
Rscript check.R data/small-0*
Rscript check.R data/inflammation-*
# Should fail check:
Rscript check.R data/small-0* data/inflammation-*
```
### Handling Command-Line Flags
The next step is to teach our program to pay attention to the `--min`, `--mean`, and `--max` flags.
These always appear before the names of the files, so let's save the following in `readings-04.R`:
```{r echo=FALSE, engine='bash'}
cat readings-04.R
```
And we can confirm this works by running it from the Unix Shell:
```{r engine='bash'}
Rscript readings-04.R --max data/small-01.csv
```
but there are several things wrong with it:
1. `main` is too large to read comfortably.
2. If `action` isn't one of the three recognized flags, the program loads each file but does nothing with it (because none of the branches in the conditional match).
[Silent failures](reference.html#silent-failure) like this are always hard to debug.
This version pulls the processing of each file out of the loop into a function of its own.
It also checks that `action` is one of the allowed flags before doing any processing, so that the program fails fast. We'll save it as `readings-05.R`:
```{r echo=FALSE, engine='bash'}
cat readings-05.R
```
This is four lines longer than its predecessor, but broken into more digestible chunks of 8 and 12 lines.
> ## Tip {.callout}
>
> R has a package named [argparse][argparse-r] that helps handle complex command-line flags (it utilizes a [Python module][argparse-py] of the same name).
> We will not cover this package in this lesson but when you start writing programs with multiple parameters you'll want to read through the package's [vignette][].
[argparse-r]: http://cran.r-project.org/web/packages/argparse/index.html
[argparse-py]: http://docs.python.org/dev/library/argparse.html
[vignette]: http://cran.r-project.org/web/packages/argparse/vignettes/argparse.pdf
> ## Challenge - Shorter command line arguments {.challenge}
>
> + Rewrite this program so that it uses `-n`, `-m`, and `-x` instead of `--min`, `--mean`, and `--max` respectively.
> Is the code easier to read?
> Is the program easier to understand?
>
> + Separately, modify the program so that if no action is specified (or an incorrect action is given), it prints a message explaining how it should be used.
```{r usage-answer, include=FALSE, engine='bash'}
cat readings-usage.R
```
### Handling Standard Input
The next thing our program has to do is read data from standard input if no filenames are given so that we can put it in a pipeline, redirect input to it, and so on.
Let's experiment in another script, which we'll save as `count-stdin.R`:
```{r echo=FALSE, engine='bash'}
cat count-stdin.R
```
This little program reads lines from the program's standard input using `file("stdin")`.
This allows us to do almost anything with it that we could do to a regular file.
In this example, we passed it as an argument to the function `readLines`, which stores each line as an element in a vector.
Let's try running it from the Unix Shell as if it were a regular command-line program:
```{r engine='bash'}
Rscript count-stdin.R < data/small-01.csv
```
Note that because we did not specify `sep = "\n"` when calling `cat`, the output is written on the same line.
A common mistake is to try to run something that reads from standard input like this:
```{r eval=FALSE, engine='bash'}
Rscript count-stdin.R data/small-01.csv
```
i.e., to forget the `<` character that redirect the file to standard input.
In this case, there's nothing in standard input, so the program waits at the start of the loop for someone to type something on the keyboard.
We can type some input, but R keeps running because it doesn't know when the standard input has ended.
If you ran this, you can pause R by typing `ctrl`+`z` (technically it is still paused in the background; if you want to fully kill the process follow these [instructions][ps-kill]).
[ps-kill]: http://linux.about.com/library/cmd/blcmdl_kill.htm
We now need to rewrite the program so that it loads data from `file("stdin")` if no filenames are provided.
Luckily, `read.csv` can handle either a filename or an open file as its first parameter, so we don't actually need to change `process`.
That leaves `main`, which we'll update and save as `readings-06.R`:
```{r echo=FALSE, engine='bash'}
cat readings-06.R
```
Let's try it out.
Instead of calculating the mean inflammation of every patient, we'll only calculate the mean for the first 10 patients (rows):
```{r engine='bash'}
head data/inflammation-01.csv | Rscript readings-06.R --mean
```
And now we're done: the program now does everything we set out to do.
> ## Challenge - Implementing wc in R {.challenge}
>
> + Write a program called `line-count.R` that works like the Unix `wc` command:
> * If no filenames are given, it reports the number of lines in standard input.
> * If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.
```{r line-count-answer, include=FALSE, engine='bash'}
cat line-count.R
```
> ## Key Points {.callout}
>
> * Use `commandArgs(trailingOnly = TRUE)` to obtain a vector of the command-line arguments that a program was run with.
> * Avoid silent failures.
> * Use `file("stdin")` to connect to a program's standard input.
> * Use `cat(vec, sep = "\n")` to write the elements of `vec` to standard output, one per line.