# install.packages(c("caret", "dplyr", "ggplot2", "tidyr"))
library(caret)
library(dplyr)
library(tidyr)
library(ggplot2)
Lab08 caret Random Forest, KNN, GLMNet
1 Knowledge Mining: caret - Random Forest, KNN, GLMNet
In this lab, we will focus on the caret package and its use for building Random Forest models. Random Forest is an ensemble learning method that constructs multiple decision trees and averages their results to provide accurate and generalized predictions. While the code in this lab demonstrates Random Forest, similar approaches can be applied using other models like K-Nearest Neighbors (KNN) and GLMNet.
KNN is a simple, instance-based learning algorithm that works well for smaller datasets but can struggle with larger ones, whereas GLMNet applies regularization techniques to handle high-dimensional data effectively. These methods offer different advantages depending on the problem and dataset at hand.
We reference the works of Kuhn et al. (2008) and the caret documentation (2020) for additional insight into building predictive models.
1.1 Load Required Libraries
2 Example 1: Random Forest Classification with the Iris Dataset
We will use the iris dataset to demonstrate the implementation of a classification model using Random Forests.
2.1 Data Preparation
First, load the iris dataset and visualize the relationship between Sepal length, width, and species.
# Load the iris dataset
data(iris)
# Examine the dataset
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# Data visualization
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
theme_bw()
2.2 Train/Test Split
Next, split the data into a training set (70%) and a testing set (30%).
# Split the data into training and testing sets (70% train, 30% test)
set.seed(123)
<- sample(1:nrow(iris), 0.7 * nrow(iris))
train_index <- iris[train_index, ]
train_data <- iris[-train_index, ] test_data
2.3 Train the Random Forest Model
We set up 10-fold cross-validation and train the model using the train() function from the caret package.
# Set up the training control
<- trainControl(method = "cv", number = 10) # 10-fold Cross-Validation
train_control
# Train the model
set.seed(123)
<- caret::train(Species ~ ., data = train_data,
model method = "rf", # Random Forest
trControl = train_control,
tuneLength = 3,
preProcess = c("center", "scale"))
# Print the model details
print(model)
Random Forest
105 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 95, 95, 95, 95, 93, 95, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9518182 0.9274934
3 0.9518182 0.9274934
4 0.9518182 0.9274934
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
2.4 Evaluate the Model
Finally, predict the species for the test data and calculate the accuracy of the model.
# Make predictions on the test data
<- predict(model, test_data)
predictions
# Calculate the accuracy of the model
<- mean(predictions == test_data$Species)
accuracy cat("Accuracy:", accuracy)
Accuracy: 0.9777778
3 Example 2: Random Forest Regression with the mtcars Dataset
Now, we will demonstrate regression using the mtcars dataset.
3.1 Data Preparation
Load the dataset and split it similarly to the previous example.
# Load the mtcars dataset
data(mtcars)
# Examine the dataset
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Split the data into training and testing sets (70% train, 30% test)
set.seed(123)
<- sample(1:nrow(mtcars), 0.7 * nrow(mtcars))
train_index <- mtcars[train_index, ]
train_data <- mtcars[-train_index, ] test_data
3.2 Train the Random Forest Model
We now train a random forest regression model to predict the mpg variable.
# Set up the training control
<- trainControl(method = "cv", number = 10)
train_control
# Train the regression model
set.seed(123)
<- train(mpg ~ ., data = train_data,
model method = "rf",
trControl = train_control,
tuneLength = 3,
preProcess = c("center", "scale"))
# Print the model details
print(model)
Random Forest
22 samples
10 predictors
Pre-processing: centered (10), scaled (10)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 19, 20, 20, 20, 20, 20, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 2.796874 0.9759378 2.437864
6 2.636784 0.9765995 2.276657
10 2.627549 0.9745273 2.270757
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 10.
3.3 Evaluate the Model
Evaluate the model’s performance using the Root Mean Squared Error (RMSE) on the test data.
# Make predictions on the test data
<- predict(model, test_data)
predictions
# Calculate the RMSE (Root Mean Squared Error) of the model
<- sqrt(mean((predictions - test_data$mpg)^2))
RMSE cat("RMSE:", RMSE)
RMSE: 2.003634
4 Conclusion
In this lab, we demonstrated how to use the caret package to train and evaluate Random Forest models for both classification and regression tasks. By splitting the data, setting up cross-validation, and using caret’s train() function, we efficiently built models for both the iris and mtcars datasets. The results demonstrate the utility of Random Forests for predictive tasks.
While Random Forest is ideal for handling complex data and preventing overfitting, alternative methods like KNN and GLMNet offer their own benefits. KNN is intuitive and works well for small datasets, and GLMNet applies regularization to manage high-dimensional data. By understanding these models, we can choose the best approach based on the problem’s requirements and the dataset’s structure.
5 References
- Kuhn, Max. 2008. “Building predictive models in R using the caret package.” Journal of Statistical Software 28: 1-26.
- Kuhn, Max, et al. “Package ‘caret’.” The R Journal 223, no. 7 (2020).
- caret Documentation