Analysis and conversion of raw data set to tidy data set was done in following steps
- Extracting feature names from
features.txt
. - Since only the required features were needed from this huge list of 561 features.Required features were extracted from this list. This could have been done by either randomly searching through the whole list of features and pick columns names that have "mean" or "std" in their names using grep command. But to actually be sure of that we are picking only the necessary columns. Manually extracted the indexes of mean and time duration for time derived, eucleadian norm, fourier transform and angle derivation and use those indexes to extract required feature names and further the same indexes was used to extract required measurements from
subject_train.txt
&subject_test1.txt
This was done as follows. - timeDerivedIndexes<-c(1:6,41:46,81:86,121:126,161:166) eucleadianNormIndexes<-c(201:202,214:215,227:228,240:241,253:254) fftIndexes<-c(266:271,345:350,424:429) angleIndexes<-c(555:561) requiredIndexes<-c(timeDerivedIndexes,eucleadianNormIndexes,fftIndexes,angleIndexes)
- Required features(measurements) were extracted from test/train directory from the file
subject_train.txt
orsubject_test.txt
and were subsetted using the reqiuiredIndexes created in step 1.
- Subject id were extracted from
X_train.txt
orX_test.txt
.
- Activity corresponding to each event was taken from
y_train.txt
ory_test.txt
.
- Subjectids, features(measurements) and activity vector extracted in step 2,3,4 were column binded using
-
requiredSignals=cbind(subjectid,activityVector,requiredSignals)
- Step 2,3 and 4 were executed first for train data set and then for test data set and finally both the results were row binded using rbind command.
- Merged data obtained in previous step was sorted in the order of first by subject id and then by activity vector to have nice representation in the form of per subject per activity measurements.
- Activity labels were taken from
activity_labels.txt
and a replacement vector was created to replace activity numbers by their actual labels. -
activityLabels<-c("1"="WALKING","2"="WALKING_UPSTAIRS","3"="WALKING_DOWNSTAIRS","4"="SITTING","5"="STANDING","6"="LAYING")
- This was used to replace the column values in activity column by labels.
- Column names were converted into descriptive column names using gsub for replacing some of the patterns such as time for t frequency for f etc.
- Data was summarized using summarise_each function of dplyr package using mean as a function and grouped_by subject id.
- Finally as a last step cleaned data was written into a cleaned_data.txt file with row.names=FALSE and quote=FALSE parameter.
NOTE : The script run_analysis.R contains a function run_analysis(directory) that accepts a directory and assumes that the directory will have the structure of following form
- test
- subject_test.txt
- X_test.txt
- y_test.txt
- train
- subject_train.txt
- X_train.txt
- y_train.txt
- features.txt
- activity_labels.txt