In the values of the constituents, we have some values with zeros, so these values must not be considered during the calibration. If we have a long data set we can look for the minimum and maximum values, and if the minimum is zero we can remove the samples with this value to develop the quantitative models.
Here are the histograms of the data sets without the samples with zeros.
We can make a PLS regression with all the samples and after to remove the outliers we found clear. So for the protein the PLS regression would be:Prot_plsr<- plsr(soy_ift_prot1$Prot~soy_ift_prot1$X_msc,
ncomp = 16,data =soy_ift_prot1,
validation = "LOO")
where we use the LOO (leave one out) validation.
The LOO cross validation, will help us to decide which is the best number of terms to choose for the regression, so we can look to one of the explained variance plot, where we can see how the RMSEP decrease as the number of terms increase, but there will be a certain number of PLS terms where the RMSEP stay stable or even increase, so we must nor choose more terms than necessary in order not to over fit the model.plot(Prot_plsr,"validation",estimate="CV")
If we look to the regression summary,
we can see that the best number of terms for the regression is nine.
Let´s see the statistics in a XY plot, and for it I am going to use a Monitor function I developed in R some time ago.
As we can see we must remove some outliers, which are out of the action limit (numbers in red), and decide what to do with the samples are out of the warning limit (numbers in orange).
The Monitor function take apart those sample, so we can remove them from the data frame and recalculate.