Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding my frequently used code chunks #3

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

veeveetran
Copy link

This this .Rmd file contains some code chunks that I often refer to when cleaning data.


These are some code chunks that I frequently come back to when processing data for the Arctic Data Center.

#Reading in raw data
Copy link
Collaborator

@isteves isteves Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub is finicky about spaces after # for the headers, so make sure to include them! RStudio will preview it just fine, but GitHub won't. (#Reading--> # Reading)

#Reading in raw data
##Single data file
```{r eval=FALSE}
df <- read.table("path/to/data",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you use read.table rather than read.csv or read_csv? I'm curious, but it might also be adding that those other options also exist.

##Single data file
```{r eval=FALSE}
df <- read.table("path/to/data",
header=T,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be spaces around the = sign. Doesn't affect the code at all, but it makes it more readable (especially once your code gets long/complicated). This is our go-to reference for style: http://style.tidyverse.org/

```{r eval=FALSE}
dataList <- vector("list", length(rawPaths)) # makes an empty list with same length as file paths vector
i=0
for(i in 1:length(rawPaths)){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally better practice to use seq_along(rawPaths), rather than 1:length(x) (which I also do all the time). It allows the code to fail more gracefully. See the discussion here: https://stackoverflow.com/questions/24917228/proper-way-to-loop-over-the-length-of-a-dataframe-in-r


Read in data using a for loop. Remember to initialize all variables that you will be using outside of the for loop.
```{r eval=FALSE}
dataList <- vector("list", length(rawPaths)) # makes an empty list with same length as file paths vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job initializing a list! I always have to stop myself from growing vectors.

for(i in 1:length(rawPaths)){
dataList[[i]] <- read.table(rawPaths[i],
na.strings = c("", "NA"),
header=T)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the indentation is a little bit off here (though maybe it's GitHub, I'm not sure). A neat trick I learned from Bryce is to highlight your code and then use Cmd + I to fix the indentation!

header=T)
}
```
Note: list() creates an empty list of length 0. However, vector("list", length(rawPaths)) allocates a designated number of slots within the list instead of the list being constantly updated every time the for loop interates. With a small number of iterations, the time it takes for the code to run is not noticeable. However, for a large number of iterations, not allocating space will cause the code to run very slowly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this reference (or something similar) is worth including in here: https://paulvanderlaken.com/2017/10/13/functional-programming-and-why-you-should-not-grow-vectors-in-r/


Iterate through all the rows in a data frame.
allRows is a vector containing "TRUE" and "FALSE". Each element corresponds to a row in dataFrame.
is.na(dataFrame[i,]) outputs "TRUE" if the row contains at least one blank cell, and "FALSE" otherwise.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ` to indicate code within sentences in Rmarkdown (like we do in slack)


#Searching Through Strings - Dates

Use the grepl() function to search for a particular string. Since we often have to reformat dates in our data sets, searching for particular dates or times could be useful.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this would be a good place to introduce some helpful resources. I personally like this cheatsheet: https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf


Run unique() to see what kind of formats there are.
```{r}
unique(dates)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovered the get_dupes function yesterday. Could be interesting to add! (or at least link to) https://cran.r-project.org/web/packages/janitor/vignettes/introduction.html


```{r}
indDates1 <- which(grepl("/16",dates))
dates[indDates1] <- format(as.POSIXct(dates[indDates1], tz = "", format="%m/%d/%y"), format = "%Y-%m-%d")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to use the lubridate package to work with dates. If you haven't tried it, I'd definitely recommend checking it out! https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html There are also some other date/time packages, but I'm not as familiar with them. tibbletime is another one seems promising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants