24 may 2022

Using Random Forest models in Soil Near Infrared Analysis (part 3)

Once we have tuned the model with the cross validation and the batch best value for "mtry", we can develop the final model we will use for routine and to check the performance with the test set we have leave apart. That is what we will do in the part 4 in the next post.

In this one I show the code for the model and the plot of the importance of every predictor variable (wavelength) in the model. 

I compare the importance scores with the SG second derivative, with the raw Calcite spectrum.


CaCO3_rf_NIRfit <- randomForest(CaCO3 ~., data = CaCO3spcSG_train,
                                importance = TRUE, ntree = 500,
                                mtry = 28)

rfImp <- varImp(CaCO3_rf_NIRfit, scale = FALSE)

matplot(seq(1110, 2488, 2), rfImp, type = "l", ylab = "Importance", 
        xlab = "wavelengths", col = "blue", lwd = 2)
par(new = TRUE)
#Overplot the Calcite spectrum
matplot(seq(1110, 2488, 2), calcite_spectrum_2nm[356:1045, ], type = "l",
        xlab = " ", ylab = " ", yaxt='n', col = "red" )
legend("topleft", # Add legend to plot
        legend = c("Importance Scores", "Calcite spectrum"),
        col = c("Blue", "red"),
        lty = 1)




23 may 2022

Using Random Forest models in Soil Near Infrared Analysis (part 2)

In this post I continue from I have left in "Using Random Forest models in Soil Near Infrared Analysis (part 1)", where we developed a Random Forest Model for Carbonate (CaCO3) in soil. Once we have the models and predictions for all the folds, we can plot the actual versus predicted for each fold:



The cross validation can help us to see if we have possible outliers, looking to the plots.

Can we improve the Model?  Yes, why not! Just try to use a batch process to tune for the best hyper-parameters (in this case for the "mtry" argument). Let´s develop a batch sequence from 2 to 30:

cv_tune <- cv_data %>%
crossing(mtry = 2:30)

Developing the models we get these RMSE mean values:
The smallest value is for mtry = 28, and after that one stars increasing.

21 may 2022

Using Random Forest models in Soil Near Infrared Analysis (part 1)

To check the performance of a model, we must split the whole set into a training and a test data sets randomly selected. We must take the test set apart and does not use it for any decision about the model. So, which is the best way to tune the model? 

One of the ways is to split it structurally or randomly into a certain number of folds, and to sequentially leave one apart for validation, and use the rest for the model development. The sequence finish when every fold has been used as validation. That’s what I have done to check the performance of a Random Forest Model for the Soil Spanish database available with LUCAS. In this case I used 4 folds, for the training set. 

The idea is to get 4 RMSEP values (one for every fold), and the mean value for all as the RMSEP value I can get for measuring CaCO3 in soil samples.

# Build a random forest model for each fold
cv_models_rf <- cv_data %>% 
  mutate(model = map(train, ~randomForest(formula = CaCO3 ~., data = .x, 
                 num.trees = 500)))

cv_prep_rf <- cv_models_rf %>% 
  mutate(
    # Extract the recorded CaCO3 for the records in the validate dataframes
    validate_actual = map(validate, ~.x$CaCO3),
    # Predict CaCO3 for each validate set using its corresponding model
    validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
  )
library(Metrics)
# Calculate validate RMSE for each fold
cv_eval_rf <- cv_prep_rf %>% 
  mutate(validate_rmse = map2_dbl(validate_actual, 
         validate_predicted, ~rmse(actual = .x, predicted = .y)))
[1] 57.52606 56.37327 53.88422 62.66018
# Print the validate_RMSE column
cv_eval_rf$validate_rmse
[1] 57.61093

The results are quite good, but there are room for improvement changing the number of trees and other tune parameters.