forked from swcarpentry/r-novice-inflammation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-supp-read-write-csv.html
269 lines (261 loc) · 25.6 KB
/
01-supp-read-write-csv.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Programming with R</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Programming with R</h1></a>
<h2 class="subtitle">Reading and writing .csv</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Read in a .csv, and explore the arguments of the csv reader</li>
<li>Write the altered data set to a new .csv, and explore the arguments</li>
</ul>
</div>
</section>
<p>The most common way that scientists store data is in Excel spreadsheets. While there are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), users often find it easier to save their spreadsheets in <a href="reference.html#comma-separated-values-(csv)">comma-separated values</a> files (CSV) and then use R’s built in functionality to read and manipulate the data. In this short lesson, we’ll learn how to read data from a .csv and write to a new .csv, and explore the <a href="reference.html#argument">arguments</a> that allow you read and write the data correctly for your needs.</p>
<h3 id="read-a-.csv-and-explore-the-arguments">Read a .csv and Explore the Arguments</h3>
<p>Let’s start by opening a .csv file containing information on the speeds at which cars of different colors were clocked in 45 mph zones in the four-corners states (<code>CarSpeeds.csv</code>). We will use the built in <code>read.csv(...)</code> <a href="reference.html#function-call">function call</a>, which reads the data in as a data frame, and assign the data frame to a variable (using <code><-</code>) so that it is stored in R’s memory. Then we will explore some of the basic arguments that can be supplied to the function.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#First, set a working directory (see lesson 'Analyzing Patient Data' for more info)</span>
<span class="kw">setwd</span>(<span class="st">'~/swc'</span>)</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#Import the data and look at the first six rows</span>
carSpeeds <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file=</span><span class="st">'data/car-speeds.csv'</span>)
<span class="kw">head</span>(carSpeeds) </code></pre>
<pre class="output"><code> Color Speed State
1 Blue 32 NewMexico
2 Red 45 Arizona
3 Blue 35 Colorado
4 White 34 Arizona
5 Red 25 Arizona
6 Blue 41 Arizona
</code></pre>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>The default delimiter of the <code>read.csv()</code> function is a comma, but you can use other delimiters by supplying the ‘sep’ argument to the function (e.g., typing <code>sep = ';'</code> allows a semi-colon separated file to be correctly imported -see <code>?read.csv()</code> for more information on this and other options for working with different file types).</p>
</div>
</aside>
<p>The call above will import the data, but we have not taken advantage of several handy arguments that can be helpful in loading the data in the format we want. Let’s explore some of these arguments.</p>
<h3 id="the-header-argument">The <code>header</code> Argument</h3>
<p>The default for <code>read.csv(...)</code> is to set the <code>header</code> argument to <code>TRUE</code>. This means that the first row of values in the .csv is set as header information (column names). If your data set does not have a header, set the <code>header</code> argument to <code>FALSE</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#The first row of the data without setting the header argument:</span>
carSpeeds[<span class="dv">1</span>,]</code></pre>
<pre class="output"><code> Color Speed State
1 Blue 32 NewMexico
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#The first row of the data if the header argument is set to FALSE:</span>
carSpeeds <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file=</span><span class="st">'data/car-speeds.csv'</span>, <span class="dt">header=</span><span class="ot">FALSE</span>)
carSpeeds[<span class="dv">1</span>,]</code></pre>
<pre class="output"><code> V1 V2 V3
1 Color Speed State
</code></pre>
<p>Clearly this is not the desired behavior for this data set, but it may be useful if you have a dataset without headers.</p>
<h3 id="the-stringsasfactors-argument">The <code>stringsAsFactors</code> Argument</h3>
<p>This is perhaps the most important argument in <code>read.csv()</code>, particularly if you are working with categorical data. This is because the default behavior of R is to convert character <a href="reference.html#string">string</a>s into factors, which may make it difficult to do such things as replace values. For example, let’s say we find out that the data collector was color blind, and accidentally recorded green cars as being blue. In order to correct the data set, let’s replace ‘Blue’ with ‘Green’ in the <code>$Color</code> column:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#Here we will use R's ifelse function, in which we provide the test phrase, the outcome if the </span>
<span class="co"># result of the test is 'TRUE', and the outcome if the result is 'FALSE'. We will also assign </span>
<span class="co"># the results to the Color column, using '<-'</span>
<span class="co">#First - reload the data with a header</span>
carSpeeds <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file=</span><span class="st">'data/car-speeds.csv'</span>)
carSpeeds$Color<-<span class="st"> </span><span class="kw">ifelse</span>(carSpeeds$Color==<span class="st">'Blue'</span>, <span class="st">'Green'</span>, carSpeeds$Color)
carSpeeds$Color</code></pre>
<pre class="output"><code> [1] "Green" "4" "Green" "5" "3" "Green" "Green" "1"
[9] "5" "3" "3" "5" "Green" "Green" "1" "3"
[17] "Green" "Green" "5" "Green" "Green" "Green" "3" "Green"
[25] "3" "3" "3" "3" "5" "Green" "3" "5"
[33] "1" "3" "1" "1" "Green" "3" "1" "3"
[41] "1" "1" "3" "3" "5" "1" "Green" "3"
[49] "3" "1" "1" "3" "5" "3" "Green" "Green"
[57] "1" "Green" "5" "1" "3" "Green" "Green" "5"
[65] "1" "3" "3" "1" "Green" "5" "Green" "3"
[73] "5" "5" "Green" "Green" "Green" "Green" "Green" "5"
[81] "1" "Green" "5" "1" "1" "3" "3" "5"
[89] "5" "5" "5" "3" "3" "3" "5" "1"
[97] "5" "1" "1" "5"
</code></pre>
<p>What happened?!? It looks like ‘Blue’ was replaced with ‘Green’, but every other color was turned into a number (as a character string, given the quote marks before and after). This is because the colors of the cars were loaded as factors, and the factor level was reported following replacement.</p>
<p>Now, let’s load the dataset using <code>stringsAsFactors=FALSE</code>, and see what happens when we try to replace ‘Blue’ with ‘Green’ in the <code>$Color</code> column:</p>
<pre class="sourceCode r"><code class="sourceCode r">carSpeeds <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file=</span><span class="st">'data/car-speeds.csv'</span>, <span class="dt">stringsAsFactors=</span><span class="ot">FALSE</span>)
carSpeeds$Color<-<span class="st"> </span><span class="kw">ifelse</span>(carSpeeds$Color==<span class="st">'Blue'</span>, <span class="st">'Green'</span>, carSpeeds$Color)
carSpeeds$Color</code></pre>
<pre class="output"><code> [1] "Green" " Red" "Green" "White" "Red" "Green" "Green" "Black"
[9] "White" "Red" "Red" "White" "Green" "Green" "Black" "Red"
[17] "Green" "Green" "White" "Green" "Green" "Green" "Red" "Green"
[25] "Red" "Red" "Red" "Red" "White" "Green" "Red" "White"
[33] "Black" "Red" "Black" "Black" "Green" "Red" "Black" "Red"
[41] "Black" "Black" "Red" "Red" "White" "Black" "Green" "Red"
[49] "Red" "Black" "Black" "Red" "White" "Red" "Green" "Green"
[57] "Black" "Green" "White" "Black" "Red" "Green" "Green" "White"
[65] "Black" "Red" "Red" "Black" "Green" "White" "Green" "Red"
[73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White"
[81] "Black" "Green" "White" "Black" "Black" "Red" "Red" "White"
[89] "White" "White" "White" "Red" "Red" "Red" "White" "Black"
[97] "White" "Black" "Black" "White"
</code></pre>
<p>That’s better!</p>
<h3 id="the-as.is-argument">The <code>as.is</code> Argument</h3>
<p>This is an extension of the <code>stringsAsFactors</code> argument, but gives you control over individual columns. For example, if we want the colors of cars imported as strings, but we want the names of the states imported as factors, we would load the data set as:</p>
<pre class="sourceCode r"><code class="sourceCode r">carSpeeds <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file=</span><span class="st">'data/car-speeds.csv'</span>, <span class="dt">as.is =</span> <span class="dv">1</span>)
<span class="co">#Note, the 1 applies as.is to the first column only</span></code></pre>
<p>Now we can see that if we try to replace ‘Blue’ with ‘Green’ in the <code>$Color</code> column everything looks fine, while trying to replace ‘Arizona’ with ‘Ohio’ in the <code>$State</code> column returns the factor numbers for the names of states that we haven’t replaced:</p>
<pre class="sourceCode r"><code class="sourceCode r">carSpeeds$Color<-<span class="st"> </span><span class="kw">ifelse</span>(carSpeeds$Color==<span class="st">'Blue'</span>, <span class="st">'Green'</span>, carSpeeds$Color)
carSpeeds$Color</code></pre>
<pre class="output"><code> [1] "Green" " Red" "Green" "White" "Red" "Green" "Green" "Black"
[9] "White" "Red" "Red" "White" "Green" "Green" "Black" "Red"
[17] "Green" "Green" "White" "Green" "Green" "Green" "Red" "Green"
[25] "Red" "Red" "Red" "Red" "White" "Green" "Red" "White"
[33] "Black" "Red" "Black" "Black" "Green" "Red" "Black" "Red"
[41] "Black" "Black" "Red" "Red" "White" "Black" "Green" "Red"
[49] "Red" "Black" "Black" "Red" "White" "Red" "Green" "Green"
[57] "Black" "Green" "White" "Black" "Red" "Green" "Green" "White"
[65] "Black" "Red" "Red" "Black" "Green" "White" "Green" "Red"
[73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White"
[81] "Black" "Green" "White" "Black" "Black" "Red" "Red" "White"
[89] "White" "White" "White" "Red" "Red" "Red" "White" "Black"
[97] "White" "Black" "Black" "White"
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">carSpeeds$State<-<span class="st"> </span><span class="kw">ifelse</span>(carSpeeds$State==<span class="st">'Arizona'</span>,<span class="st">'Ohio'</span>,carSpeeds$State)
carSpeeds$State</code></pre>
<pre class="output"><code> [1] "3" "Ohio" "2" "Ohio" "Ohio" "Ohio" "3" "2" "Ohio" "2"
[11] "4" "4" "4" "4" "4" "3" "Ohio" "3" "Ohio" "4"
[21] "4" "4" "3" "2" "2" "3" "2" "4" "2" "4"
[31] "3" "2" "2" "4" "2" "2" "3" "Ohio" "4" "2"
[41] "2" "3" "Ohio" "4" "Ohio" "2" "3" "3" "3" "2"
[51] "Ohio" "4" "4" "Ohio" "3" "2" "4" "2" "4" "4"
[61] "4" "2" "3" "2" "3" "2" "3" "Ohio" "3" "4"
[71] "4" "2" "Ohio" "4" "2" "2" "2" "Ohio" "3" "Ohio"
[81] "4" "2" "2" "Ohio" "Ohio" "Ohio" "4" "Ohio" "4" "4"
[91] "4" "Ohio" "Ohio" "3" "2" "2" "4" "3" "Ohio" "4"
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Challenge - Updating values in a factor</h2>
</div>
<div class="panel-body">
<p>Suppose we want to keep the colors of cars as factors for some other operations we want to perform. Write code for replacing ‘Blue’ with ‘Green’ in the <code>$Color</code> column of the cars dataset without importing the data with <code>stringsAsFactors=FALSE</code>.</p>
</div>
</section>
<h3 id="the-strip.white-argument">The <code>strip.white</code> Argument</h3>
<p>It is not uncommon for mistakes to have been made when the data were recorded, for example a space (whitespace) may have been inserted before a data value. By default this whitespace will be kept in the R environment, such that ‘ Red’ will be recognized as a different value than ‘Red’. In order to avoid this type of error, use the <code>strip.white</code> argument. Let’s see how this works by checking for the unique values in the <code>$Color</code> column of our dataset:</p>
<p>Here, the data recorder added a space before the color of the car in one of the cells:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#We use the built in unique() function to extract the unique colors in our dataset</span>
<span class="kw">unique</span>(carSpeeds$Color)</code></pre>
<pre class="output"><code>[1] Green Red White Red Black
Levels: Black Green Red Red White
</code></pre>
<p>Oops, we see two values for red cars.</p>
<p>Let’s try again, this time importing the data using the <code>strip.white</code> argument. NOTE - this argument must be accompanied by the <code>sep</code> argument, by which we indicate the type of delimiter in the file (the comma for most .csv files)</p>
<pre class="sourceCode r"><code class="sourceCode r">carSpeeds<-<span class="kw">read.csv</span>(<span class="dt">file=</span><span class="st">'data/car-speeds.csv'</span>,<span class="dt">stringsAsFactors=</span><span class="ot">FALSE</span>,<span class="dt">strip.white=</span><span class="ot">TRUE</span>,<span class="dt">sep=</span><span class="st">','</span>)
<span class="kw">unique</span>(carSpeeds$Color)</code></pre>
<pre class="output"><code>[1] "Blue" "Red" "White" "Black"
</code></pre>
<p>That’s better!</p>
<h3 id="write-a-new-.csv-and-explore-the-arguments">Write a New .csv and Explore the Arguments</h3>
<p>After altering our cars dataset by replacing ‘Blue’ with ‘Green’ in the <code>$Color</code> column, we now want to save the output. There are several arguments for the <code>write.csv(...)</code> <a href="reference.html#function-call">function call</a>, a few of which are particularly important for how the data are exported. Let’s explore these now.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#Export the data. The write.csv() function requires a minimum of two arguments, the data to be saved and the name of the output file.</span>
<span class="kw">write.csv</span>(carSpeeds, <span class="dt">file=</span><span class="st">'data/car-speeds-cleaned.csv'</span>)</code></pre>
<p>If you open the file, you’ll see that it has header names, because the data had headers within R, but that there are numbers in the first column.</p>
<p><img src="fig/01-supp-csv-with-row-nums.png" alt="csv written without row.names argument" /></p>
<h3 id="the-row.names-argument">The <code>row.names</code> Argument</h3>
<p>This argument allows us to set the names of the rows in the output data file. R’s default for this argument is <code>TRUE</code>, and since it does not know what else to name the rows for the cars data set, it resorts to using row numbers. To correct this, we can set <code>row.names</code> to <code>FALSE</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">write.csv</span>(carSpeeds, <span class="dt">file=</span><span class="st">'data/car-speeds-cleaned.csv'</span>, <span class="dt">row.names=</span><span class="ot">FALSE</span>)</code></pre>
<p>Now we see:</p>
<p><img src="fig/01-supp-csv-without-row-nums.png" alt="csv written with row.names argument" /></p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>there is also a <code>col.names</code> argument, which can be used to set the column names for a data set without headers. If the data set already has headers (e.g., we used the <code>headers = TRUE</code> argument when importing the data) then a <code>col.names</code> argument will be ignored.</p>
</div>
</aside>
<h3 id="the-na-argument">The <code>na</code> Argument</h3>
<p>There are times when we want to specify certain values for <code>NA</code>s in the data set (e.g., we are going to pass the data to a program that only accepts -9999 as a nodata value). In this case, we want to set the <code>NA</code> value of our output file to the desired value, using the na argument. Let’s see how this works:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#First, replace the speed in the 3rd row with NA, by using an index (square brackets to indicate the position of the value we want to replace)</span>
carSpeeds$Speed[<span class="dv">3</span>]<-<span class="ot">NA</span>
<span class="kw">head</span>(carSpeeds)</code></pre>
<pre class="output"><code> Color Speed State
1 Blue 32 NewMexico
2 Red 45 Arizona
3 Blue NA Colorado
4 White 34 Arizona
5 Red 25 Arizona
6 Blue 41 Arizona
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">write.csv</span>(carSpeeds, <span class="dt">file=</span><span class="st">'data/car-speeds-cleaned.csv'</span>, <span class="dt">row.names=</span><span class="ot">FALSE</span>)</code></pre>
<p>Now we’ll set <code>NA</code> to -9999 when we write the new .csv file:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co">#Note - the na argument requires a string input</span>
<span class="kw">write.csv</span>(carSpeeds, <span class="dt">file=</span><span class="st">'data/car-speeds-cleaned.csv'</span>, <span class="dt">row.names=</span><span class="ot">FALSE</span>, <span class="dt">na=</span> <span class="st">'-9999'</span>)</code></pre>
<p>And we see:</p>
<p><img src="fig/01-supp-csv-with-special-NA.png" alt="csv written with -9999 as NA" /></p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Key Points</h2>
</div>
<div class="panel-body">
<ul>
<li>Import data from a .csv file using the <code>read.csv(...)</code> function.</li>
<li>Understand some of the key arguments available for importing the data properly</li>
<li><code>header</code></li>
<li><code>stringsAsFactors</code></li>
<li><code>as.is</code></li>
<li><code>strip.white</code></li>
<li>Write data to a new .csv file using the <code>write.csv(...)</code> function</li>
<li>Understand some of the key arguments available for exporting the data properly</li>
<li><code>row.names</code> / <code>col.names</code></li>
<li><code>na</code></li>
</ul>
</div>
</aside>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Next Steps</h2>
</div>
<div class="panel-body">
<p>We have now explored data import and export from .csv files. There are several more arguments available for both the <code>read.csv(...)</code> and <code>write.csv(...)</code> functions, as well as similar options for working with data that are stored in other text formats (e.g., .txt). To find out more, use the R help commands ‘?read.csv()’ and ‘?write.csv()’.</p>
<p>In the next lessons, we’ll explore how to work with and analyze data in the R environment.</p>
</div>
</aside>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/r-novice-inflammation">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>