In the Caret package, we have a data set called “tecator” with data from an Infratec for meat. In the book “Applied Predictive Modelling”, is used as an exercise in the Chapter : “Linear Regression and its Cousins”, so I´m going to use it in this and some coming posts.
When we develop a PLS equation with the function “plsr” in the “pls” package we get several values, and one of them is “validation”, where we get a list with the predictions for the number of terms selected for the samples in the training set. With these values, several calculations will define the best model, so we does not overfit it.
Anyway, to keep apart a random set for validation will help us to adopt the best decision for the selection of terms. For this, we can use the “createDataPartition” from the Caret package. The “predict” using the developed model and the external validation set will give us the predictions for the external validation set and comparing this values with the reference values we will obtain the RMSE (using the RMSE) function, so we can decide the number of terms to use for the model we use finally in routine.
We normally prefer plots to see performance of a model, but the statistics (numbers) will really decide how the performance is.
In the case I use 5 terms (seems to be the best option), the predictions for the training set are:
In addition, the predictions for the external validation set are:
newdata = test.dt$NIR.dt)
With these values we can plot the performance of the model:
Due to the high range of this parameters we can see plots as this, where we can see area ranges with bias, others with more random noise, or others with outliers.