forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
86 lines (73 loc) · 3.68 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Reproducible Research: Peer Assessment 1
## Loading and preprocessing the data
```{r}
#it is assumed that 'activity.zip' file is in working directory
unzip(zipfile="activity.zip")
data <- read.csv("activity.csv", na.strings = "NA", header = TRUE)
#tranform date variable to the date data type
data$date = as.Date(as.character(data$date), "%Y-%m-%d")
```
## What is mean total number of steps taken per day?
```{r}
#calculate the number of steps taken per day
daysteps<-tapply(data$steps, data$date, sum, na.rm=TRUE)
#create histogram
par(mar=c(5,4,2,2))
hist(daysteps, xlab="Number of steps per day", main=NULL)
mean1<-mean(daysteps)
print(mean1)
median1<-median(daysteps)
print(median1)
```
The mean of total number of steps taken per day was *`r mean1`*, and the median total number of steps taken per day was *`r median1`*.
## What is the average daily activity pattern?
```{r}
#Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
data2<-aggregate(steps~interval, data=data, FUN=mean, na.rm=TRUE)
par(mar=c(5,4,2,2))
plot(data2$interval, data2$steps, type='l', ylab= "Average No of Steps", xlab= "Five-minute time interval")
#Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
maxinterval<-data2[data2$steps==max(data2$steps),'interval']
maxsteps<-data2[data2$steps==max(data2$steps),'steps']
```
The *`r maxinterval`*-th five-minute interval contained the maximal *`r maxsteps`* average number of steps.
## Imputing missing values
There are a number of days/intervals where there are missing values (coded as NA).
The presence of missing days may introduce bias into some calculations or summaries of the data.
```{r}
#Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
NoMissing <- is.na(data$steps)
totNA <- sum(NoMissing)
```
The `r totNA` number of missing values were substitued using the mean value of particular five-minute interval.
The new dataset that is equal to the original dataset but with the missing data filled in was created.
I found [here] [1], how to do that, in easy way.
Then the histogram of the total number of steps taken each day was created and
the mean and median total number of steps taken per day was calculated.
[1]: http://stackoverflow.com/questions/20273070/function-to-impute-missing-value/
```{r}
cleandata<-data
cleandata$steps[is.na(cleandata$steps)] <- ave(cleandata$steps, cleandata$interval,
FUN = function(x)
mean(x, na.rm = TRUE))[c(which(is.na(cleandata$steps)))]
a<-tapply(cleandata$steps,cleandata$date,sum)
par(mar=c(5,4,2,2))
hist(a, xlab="Number of steps per day", main=NULL)
mean(a)
median(a)
```
We can see that these values differ from the estimates calculated in the first part of the assignment. They are both higher now. Moreover, the distribution of the data is much symmetrical now, since the median and mean values are identical.
## Are there differences in activity patterns between weekdays and weekends?
```{r results="hide"}
#since I'm using non english system I overwrite local language settings to get the days in english.
Sys.setlocale("LC_TIME", "English")
```
```{r}
cleandata$weekend<-sapply(cleandata$date,function(x){
if (weekdays(x)=="Sunday" | weekdays(x) == "Saturday") {"weekend"}
else {"weekday"}
})
data3<-aggregate(steps~interval+weekend, data=cleandata, FUN=mean)
require(lattice)
xyplot(steps~interval | weekend, type='l', data=data3, layout = c(1, 2), ylab= "Average No of Steps", xlab= "Five-minute time interval")
```