title

author

date

output

Practical Machine Learning

Lorenz Hexemer

14 8 2022

html_document

keep_md
true

Executive Summary

Accelerometers can be used to quantify body movement. The data used in this project is collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of this project is to predict the manner in which they did these exercises. To achieve this, after an exploratory data analysis, different machine learning models will be applied and compared in their performance in prediction. Cross validation will be used to estimate the out of sample error.

Data import and cleaning

  full_data <- read.csv("pml-training.csv",  na.strings=c("NA","#DIV/0!", ""))
  full_data <- full_data[,colSums(is.na(full_data)) == 0]
  full_data <- full_data[,-c(1:7)]
  
  varNames <- names(full_data)
  for (i in 1:(length(varNames)-1)) {
    full_data[,i] <- as.numeric(full_data[,i])
  }
  
  subSamples <- createDataPartition(y=full_data$classe, p=0.75, list=FALSE)
  subTraining <- full_data[subSamples, ] 
  subTesting <- full_data[-subSamples, ]
  table(subTraining$classe)

## 
##    A    B    C    D    E 
## 4185 2848 2567 2412 2706

First, variables that contain only NaNs are removed. Moreover, the X variable is removed since the data set is sorted according to the classes. Therefore, this variable implicitly labeles the classes. Moreover, the first 6 variables ("user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window" and "num_window") are removed since they do not encode data on the movements.

The 5 classes A to E correspond to the different ways, the barbell lifts were performed. The classes are unevenly represented in the dataset: Class A is occuring signifficantly more often than the other classes.

Data analysis

Exploratory Data Analysis

Many variables are uncorrelated and hence probably contribute independent information. In No sigle variable or combination of two variables, a clear separation of the classes is observable.

pcaResult <- prcomp(subTraining[,-53],scale=TRUE);
f3 <- fviz_eig(pcaResult);
f4 <- fviz_pca_ind(pcaResult,col.ind = subTraining$classe, geom = "point");
f5 <- fviz_pca_var(pcaResult);

The first two prinicpal components contribute ~ 15% of explained variability. Still, the classes seem not easily seperatable in the PCA (figures see appendix).

Modeling

Different models are trained on the same data: A logistic regression model, decision trees with and without boosting and random forest. In the next section, the best model is slected.

Logistic Model

ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 1, 
                     classProbs = TRUE,
                     #preProcOptions =  list(pcaComp = 46),
                     summaryFunction = multiClassSummary,
                     verbose = FALSE)

logitModel <- train(classe ~ ., subTraining, 
                     method = "multinom", 
                     family=binomial, 
                     metric = "Accuracy",  
                     verbose = FALSE, 
                     preProcess=c("center", "scale"), #,"pca"
                     trControl = ctrl)

predictTest <- predict(logitModel,subTesting)
logitConfusion <- confusionMatrix(subTesting$classe, predictTest)
logitOSPR <- mean(predictTest != subTesting$classe)
print(logitConfusion)

Decision Tree

As an alternative, a decision tree model is fitted.

library(rpart)
decisionTree <- rpart(classe ~ ., data=subTraining, method="class");
predictTest <- predict(decisionTree,subTesting, type = "class")
treeOSPR <- mean(predictTest != subTesting$classe)
treeConfusion <- confusionMatrix(predictTest, subTesting$classe);
print(treeConfusion)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1250  205   25   87   41
##          B   40  517   32   22   61
##          C   33  105  687  124  108
##          D   45   72   59  513   59
##          E   27   50   52   58  632
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7339          
##                  95% CI : (0.7213, 0.7462)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6618          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8961   0.5448   0.8035   0.6381   0.7014
## Specificity            0.8980   0.9608   0.9086   0.9427   0.9533
## Pos Pred Value         0.7774   0.7693   0.6500   0.6858   0.7717
## Neg Pred Value         0.9560   0.8979   0.9563   0.9300   0.9341
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2549   0.1054   0.1401   0.1046   0.1289
## Detection Prevalence   0.3279   0.1370   0.2155   0.1525   0.1670
## Balanced Accuracy      0.8970   0.7528   0.8561   0.7904   0.8274

The performance of decision trees can be improved by boosing.

ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 5, 
                     classProbs = TRUE,
                     #preProcOptions =  list(pcaComp = 46),
                     summaryFunction = multiClassSummary)

boostedTree <- train(classe ~ ., subTraining, 
                     method = "gbm", 
                     metric = "Accuracy",  
                     verbose = FALSE, 
                     preProcess=c("center", "scale"), #,"pca"
                     trControl = ctrl)

predictTest <- predict(boostedTree,subTesting)
boostTreeOSPR <- mean(predictTest != subTesting$classe)
boostTreeConfusion <- confusionMatrix(subTesting$classe, predictTest);
print(boostTreeConfusion)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1370   15    7    2    1
##          B   23  896   26    3    1
##          C    0   26  821    8    0
##          D    0    6   27  768    3
##          E    3    9    4   25  860
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9615          
##                  95% CI : (0.9557, 0.9667)
##     No Information Rate : 0.2847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9513          
##                                           
##  Mcnemar's Test P-Value : 1.997e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9814   0.9412   0.9277   0.9529   0.9942
## Specificity            0.9929   0.9866   0.9915   0.9912   0.9898
## Pos Pred Value         0.9821   0.9442   0.9602   0.9552   0.9545
## Neg Pred Value         0.9926   0.9858   0.9842   0.9907   0.9988
## Prevalence             0.2847   0.1941   0.1805   0.1644   0.1764
## Detection Rate         0.2794   0.1827   0.1674   0.1566   0.1754
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9871   0.9639   0.9596   0.9720   0.9920

Random Forest

randomForestModel <- randomForest(classe ~ ., data=subTraining, method="class")
predictTest <- predict(randomForestModel,subTesting)
randForestOSPR <- mean(predictTest != subTesting$classe)
randForestConfusion <- confusionMatrix(subTesting$classe, predictTest)
print(randForestConfusion)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    1    0    0    0
##          B    8  940    1    0    0
##          C    0    3  850    2    0
##          D    0    0   12  791    1
##          E    0    0    0    3  898
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9937         
##                  95% CI : (0.991, 0.9957)
##     No Information Rate : 0.2859         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.992          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9943   0.9958   0.9849   0.9937   0.9989
## Specificity            0.9997   0.9977   0.9988   0.9968   0.9993
## Pos Pred Value         0.9993   0.9905   0.9942   0.9838   0.9967
## Neg Pred Value         0.9977   0.9990   0.9968   0.9988   0.9998
## Prevalence             0.2859   0.1925   0.1760   0.1623   0.1833
## Detection Rate         0.2843   0.1917   0.1733   0.1613   0.1831
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9970   0.9967   0.9918   0.9953   0.9991

Model Selection and Discussion

To compare the performance, the out of sample error was used. The out of sample error was estimated by evaluating the prediction performance on the test-dataset which was not used for fittig.

OSPR <- data.frame(Model=c("Logistic Model","Decision Tree","Boosted Decision Tree","Random Forest"),OSPR=c(logitOSPR,treeOSPR,boostTreeOSPR,randForestOSPR))
ggplot(OSPR,aes(y=OSPR,x=Model)) + geom_bar(stat="identity")

The Decision Tree model did not perform better than the logistic model. The gradient boosting, though, improved the performance of the Decision Tree to achieve a out of sample error < 5%. Finally, the Random Forest model is selected for the prediction since its out of sample error is estimated to be < 1%.

Prediction

In this section, the best model is used to predict the outcome of the 20 cases from the independent testing dataset.

  full_data <- read.csv("pml-testing.csv",  na.strings=c("NA","#DIV/0!", ""))
  full_data <- full_data[,colSums(is.na(full_data)) == 0]
  full_data <- full_data[,-c(1:7)]
  
  varNames <- names(full_data)
  for (i in 1:(length(varNames)-1)) {
    full_data[,i] <- as.numeric(full_data[,i])
  }
  predict(randomForestModel,full_data)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Appendix

f2

f3

f4

f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PracticalMachineLearning.md

PracticalMachineLearning.md

Executive Summary

Data import and cleaning

Data analysis

Exploratory Data Analysis

Modeling

Logistic Model

Decision Tree

Random Forest

Model Selection and Discussion

Prediction

Appendix

Files

PracticalMachineLearning.md

Latest commit

History

PracticalMachineLearning.md

File metadata and controls

Executive Summary

Data import and cleaning

Data analysis

Exploratory Data Analysis

Modeling

Logistic Model

Decision Tree

Random Forest

Model Selection and Discussion

Prediction

Appendix