-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding my frequently used code chunks #3
base: master
Are you sure you want to change the base?
Conversation
|
||
These are some code chunks that I frequently come back to when processing data for the Arctic Data Center. | ||
|
||
#Reading in raw data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub is finicky about spaces after # for the headers, so make sure to include them! RStudio will preview it just fine, but GitHub won't. (#Reading
--> # Reading
)
#Reading in raw data | ||
##Single data file | ||
```{r eval=FALSE} | ||
df <- read.table("path/to/data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you use read.table
rather than read.csv
or read_csv
? I'm curious, but it might also be adding that those other options also exist.
##Single data file | ||
```{r eval=FALSE} | ||
df <- read.table("path/to/data", | ||
header=T, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be spaces around the =
sign. Doesn't affect the code at all, but it makes it more readable (especially once your code gets long/complicated). This is our go-to reference for style: http://style.tidyverse.org/
```{r eval=FALSE} | ||
dataList <- vector("list", length(rawPaths)) # makes an empty list with same length as file paths vector | ||
i=0 | ||
for(i in 1:length(rawPaths)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally better practice to use seq_along(rawPaths)
, rather than 1:length(x)
(which I also do all the time). It allows the code to fail more gracefully. See the discussion here: https://stackoverflow.com/questions/24917228/proper-way-to-loop-over-the-length-of-a-dataframe-in-r
|
||
Read in data using a for loop. Remember to initialize all variables that you will be using outside of the for loop. | ||
```{r eval=FALSE} | ||
dataList <- vector("list", length(rawPaths)) # makes an empty list with same length as file paths vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job initializing a list! I always have to stop myself from growing vectors.
for(i in 1:length(rawPaths)){ | ||
dataList[[i]] <- read.table(rawPaths[i], | ||
na.strings = c("", "NA"), | ||
header=T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the indentation is a little bit off here (though maybe it's GitHub, I'm not sure). A neat trick I learned from Bryce is to highlight your code and then use Cmd + I to fix the indentation!
header=T) | ||
} | ||
``` | ||
Note: list() creates an empty list of length 0. However, vector("list", length(rawPaths)) allocates a designated number of slots within the list instead of the list being constantly updated every time the for loop interates. With a small number of iterations, the time it takes for the code to run is not noticeable. However, for a large number of iterations, not allocating space will cause the code to run very slowly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this reference (or something similar) is worth including in here: https://paulvanderlaken.com/2017/10/13/functional-programming-and-why-you-should-not-grow-vectors-in-r/
|
||
Iterate through all the rows in a data frame. | ||
allRows is a vector containing "TRUE" and "FALSE". Each element corresponds to a row in dataFrame. | ||
is.na(dataFrame[i,]) outputs "TRUE" if the row contains at least one blank cell, and "FALSE" otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ` to indicate code within sentences in Rmarkdown (like we do in slack)
|
||
#Searching Through Strings - Dates | ||
|
||
Use the grepl() function to search for a particular string. Since we often have to reformat dates in our data sets, searching for particular dates or times could be useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this would be a good place to introduce some helpful resources. I personally like this cheatsheet: https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
|
||
Run unique() to see what kind of formats there are. | ||
```{r} | ||
unique(dates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discovered the get_dupes
function yesterday. Could be interesting to add! (or at least link to) https://cran.r-project.org/web/packages/janitor/vignettes/introduction.html
|
||
```{r} | ||
indDates1 <- which(grepl("/16",dates)) | ||
dates[indDates1] <- format(as.POSIXct(dates[indDates1], tz = "", format="%m/%d/%y"), format = "%Y-%m-%d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to use the lubridate
package to work with dates. If you haven't tried it, I'd definitely recommend checking it out! https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html There are also some other date/time packages, but I'm not as familiar with them. tibbletime
is another one seems promising.
This this .Rmd file contains some code chunks that I often refer to when cleaning data.