21 may 2022

Using Random Forest models in Soil Near Infrared Analysis (part 1)

To check the performance of a model, we must split the whole set into a training and a test data sets randomly selected. We must take the test set apart and does not use it for any decision about the model. So, which is the best way to tune the model? 

One of the ways is to split it structurally or randomly into a certain number of folds, and to sequentially leave one apart for validation, and use the rest for the model development. The sequence finish when every fold has been used as validation. That’s what I have done to check the performance of a Random Forest Model for the Soil Spanish database available with LUCAS. In this case I used 4 folds, for the training set. 

The idea is to get 4 RMSEP values (one for every fold), and the mean value for all as the RMSEP value I can get for measuring CaCO3 in soil samples.

# Build a random forest model for each fold
cv_models_rf <- cv_data %>% 
  mutate(model = map(train, ~randomForest(formula = CaCO3 ~., data = .x, 
                 num.trees = 500)))

cv_prep_rf <- cv_models_rf %>% 
  mutate(
    # Extract the recorded CaCO3 for the records in the validate dataframes
    validate_actual = map(validate, ~.x$CaCO3),
    # Predict CaCO3 for each validate set using its corresponding model
    validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
  )
library(Metrics)
# Calculate validate RMSE for each fold
cv_eval_rf <- cv_prep_rf %>% 
  mutate(validate_rmse = map2_dbl(validate_actual, 
         validate_predicted, ~rmse(actual = .x, predicted = .y)))
[1] 57.52606 56.37327 53.88422 62.66018
# Print the validate_RMSE column
cv_eval_rf$validate_rmse
[1] 57.61093

The results are quite good, but there are room for improvement changing the number of trees and other tune parameters.

No hay comentarios:

Publicar un comentario