The proliferation of affordable wearable technology, including devices like Fitbit, Jawbone Up, and Nike FuelBand, has made it easier to collect extensive data on personal activities. While these devices are often used to track the volume of physical activity, they can also provide valuable insights into the quality and performance of such activities. This project aims to predict the performance of six participants during barbell lifts, using data from accelerometers placed on their belt, forearm, arm, and dumbbell.
For more information about this dataset, please refer to the following source:
Dataset Source
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
To properly evaluate the performance of the model before applying it to the unseen test set, the training dataset (train_set) is split into two parts: an 80% training subset and a 20% validation subset. This split allows us to fine-tune the model using the training data while assessing its ability to generalize to new, unseen data using the validation set. By doing this, we ensure that our model is not overfitting to the training data and that its performance on the validation set provides a more accurate estimate of how it will perform on the test set.
In this cleaning process, we reduce the number of variables from 160 to 53 by applying several steps. First, we remove columns that have the same value for all rows, as these do not provide any variability or meaningful information. Next, we eliminate columns with more than 50% missing values, ensuring the dataset remains informative and reducing noise. Finally, we discard irrelevant columns, such as timestamps and window identifiers, that are not useful for predicting the target variable, “classe.” These steps help retain only the most relevant and reliable features for model building, resulting in a more manageable and focused dataset.
This plot illustrates the distribution of the target variable classe across different users (user_name) within the training set. Each point, jittered for clarity, represents an observation, with colors distinguishing the various classes. Count labels for each combination of user_name and classe provide a clear view of the data’s density and spread. The visualization confirms that the training data is well-balanced, with observations evenly distributed among classes and users, ensuring a robust foundation for model training and evaluation.
Before analyzing the correlation heatmap, it’s important to understand the relationships between the numerical variables and the target variable, “classe.” This helps identify redundant variables and those most likely to contribute to the model. The heatmap below visualizes these correlations, allowing us to select relevant features and reduce multicollinearity.
After cleaning the dataset, several steps were taken to reduce the number of variables. First, highly correlated variables (with a correlation above 0.9) were removed to prevent multicollinearity and improve model stability. Additionally, some redundant or irrelevant variables were discarded. As a result, the number of variables was reduced from 53 to 45, ensuring that only the most informative and independent features remain for model building. This process helps enhance the model’s performance and interpretability.
To ensure a robust evaluation of the model’s performance, we employed 5-fold cross-validation during the training process. This technique splits the dataset into five parts, using four folds for training and one for testing in each iteration, rotating through all the folds. This approach helps assess how well the model generalizes to unseen data and minimizes the risk of overfitting.
We trained a Random Forest model using the train() function from the caret package, leveraging its built-in support for cross-validation. The model’s performance metrics, including accuracy, were averaged across the folds, providing an estimate of out-of-sample performance.
The accuracy results for each fold are visualized in the plot below, showing the consistency of the model across the cross-validation iterations. This demonstrates stable performance, with minimal variation between folds, indicating a well-generalized model:
The line fluctuates only slightly between 0.992 and 0.996, indicating that the model’s performance is stable across the folds. This small variation suggests good generalization, with no signs of overfitting or underfitting, and consistent accuracy throughout the cross-validation process.
The variable importance plot provides valuable insights into which features contribute most to the Random Forest model’s predictions. By using the varImp function, we extract the importance scores of each feature, which are then visualized in the plot. The top 10 most important features are displayed to highlight those that have the greatest impact on the model’s performance. This aggregated feature importance is calculated over the cross-validation folds, offering a more robust understanding of the relative significance of each variable. Identifying key features helps in interpreting the model and can guide further feature selection or engineering.
## Validation Process and Results
After training our Random Forest model, it’s crucial to assess its performance on the validation set to ensure it generalizes well to new, unseen data. The model’s accuracy on the validation set provides a robust measure of its effectiveness. Below, we present the accuracy of the model on the validation set, which will help us determine if the model can reliably predict the performance of barbell lifts.
The confusion matrix above provides an insightful view into how well the model performs across different classes. The diagonal cells (A-A, B-B, C-C, D-D, E-E) show the number of correct predictions for each class, with high counts, indicating strong predictive accuracy. For example, class A has 1115 correct predictions, and class E has 713. Off-diagonal cells represent misclassifications, such as 5 instances where the model incorrectly predicted class A as class B. These misclassifications are relatively few, suggesting the model’s general robustness. However, some classes, like C and D, have small misclassification counts (e.g., C predicted as D). The pattern of the matrix indicates that the model is performing well, but further refinement, such as addressing these few misclassifications, could further enhance its accuracy.