I´ve been working these days with PLS2 calibrations with a chemometric software called “Unscrambler” with a data set called “jam”. I said “can I develop PLS2 models with R?”.I look in the book “Introduction to Multivariate Statistical Analysis in Chemometrics”, and I got the response “Yes, we can”.
I have other posts for PLS regressions, but it is PLS1, where we have an X matrix (spectra) and we make a regression for one constituent of the Y matrix at a time. What about to make the regression for all the constituents at the same time using the whole Y matrix?. That is the purpose of PLS2.
PLS2 is recommended when there is a high correlation between the constituents.
library(chemometrics)
data(cereal)
This data is part of a set used by Varmuza et al. 2008, for other papers.
You can get a description for this data in the R help page:
Description
For 15 cereals an X and Y data set, measured on the same objects, is available. The X data are 145 infrared spectra, and the Y data are 6 chemical/technical properties (Heating value, C, H, N, Starch, Ash). Also the scaled Y data are included (mean 0, variance 1 for each column). The cereals come from 5 groups B=Barley, M=Maize, R=Rye, T=Triticale, W=Wheat.
Once loaded, take a look to the data
dim(cereal$X)
dim(cereal$Ysc)
We can have a look to the spectra, (it is already treated with SG and first derivative).
wavelengths<-seq(1126,2278,by=8)
matplot(wavelengths,t(cereal$X),lty=1,xlab="wavelengths(nm)",ylab="log(1/R)")
Now let´s run PLS2, using “mvsr”, with “LOO” (leave one out) cross validation.
cerpls2<-mvr(Ysc~X,data=cereal,method="simpls",validation="LOO")
We can see a summary of the results:
summary(cerpls2)
Now we have to take an important decision, “How many terms to choose?”.
Plots can help us with it:
plot(RMSEP(cerpls2), legendpos = "topright")
We have to select an average, and looking to the plots we can say that 7 is fine, anyway for "starch" less terms would be fine, but for the rest 6 or 7 is the correct number.