15 feb. 2018

PCR vs. PLS (part 3)

A common way to develop the regressions with PCR or PLS is with "Cross Validation", where we divide the training data set into groups that you can select depending of the number of samples: In the case you have few samples you can select a high (10 for example) value and in the case you have a lot of samples, you can select a low value (for example 4). One group is keep out and the regression is developed with all the others. We use the group keep outside to validate repeating the process as many times as validation groups.
So Iin the case of 4 groups), we obtain 4 validation statistic values as RMSEP, RSQ, Bias,... for every group we use in the development of the regression, so we can see from which term the RMSEP start to become higher, or the RMSEP values are almost flat and make not sense to add more terms to the regression.
Really the cross validation help us to avoid over-fitting, but it helps also to detect outliers (good or bad outliers we can say).
It is interesting that the samples in the training set have a certain way of sorting or randomnes . We don´t want that there are similar samples in the training and validation set, but this require inspection of the spectra previous to the development of the regression looking for neighbors, or other sources of redundancy. In the case there are similar samples we want that they stay if possible in the same  group.
In the development of the PCR, we van select the Cross Validation option and also the number of segments or groups for validation, in this case 10:

Xodd_pcr3<-pcr(Prot[odd,] ~ X_msc[odd,],validation="CV",


If we see the plot of the RMSEP for the 30 terms, we see that the in the fourth the RMSEP decrease dramatically, but after 10 it does make not sense to use those terms, so we have to select 10 or less.
An external validation (if possible with more independent samples) can help us with the decision.
A particular case of Cross Validation is the Leave One Out validation, where a sample unique sample is keep out for validation and the rest stay for the calibration. The process is repeated until all the samples have been part of the validation process. So in this case there are the same numbers of segments or groups than number of samples. This process is quite interesting when we have few samples in the training set.

About the X-Y plots (Predicted vs Reference values), it is important to interpret them in the case of cross validation. There are plots which show the predicted values vs the reference values when all the samples are part of the regression (blue dots) and those plots are not realistic, so it is better to see the plot, when every dot (sample), has a value when it is not part of the model because it is  in the validation set (red dots), this way we have a better idea of the performance of the regression for future samples.

This is not a nice tutorial set where the samples fit well, but it is what you often fine in several applications where you want to try to get the best model for a certain product parameter and instrument.

No hay comentarios:

Publicar un comentario