One of the ways is to split it structurally or randomly into a certain number of folds, and to sequentially leave one apart for validation, and use the rest for the model development. The sequence finish when every fold has been used as validation.
That’s what I have done to check the performance of a Random Forest Model for the Soil Spanish database available with LUCAS.
In this case I used 4 folds, for the training set.
The idea is to get 4 RMSEP values (one for every fold), and the mean value for all as the RMSEP value I can get for measuring CaCO3 in soil samples.
# Build a random forest model for each fold
cv_models_rf <- cv_data %>%
mutate(model = map(train, ~randomForest(formula = CaCO3 ~., data = .x,
num.trees = 500)))
cv_prep_rf <- cv_models_rf %>%
mutate(
# Extract the recorded CaCO3 for the records in the validate dataframes
validate_actual = map(validate, ~.x$CaCO3),
# Predict CaCO3 for each validate set using its corresponding model
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
)
library(Metrics)
# Calculate validate RMSE for each fold
cv_eval_rf <- cv_prep_rf %>%
mutate(validate_rmse = map2_dbl(validate_actual,
validate_predicted, ~rmse(actual = .x, predicted = .y)))
[1] 57.52606 56.37327 53.88422 62.66018
# Print the validate_RMSE column
cv_eval_rf$validate_rmse
[1] 57.61093
The results are quite good, but there are room for improvement changing the number of trees and other tune parameters.
No hay comentarios:
Publicar un comentario