10 feb. 2018

PCR vs PLS (part 1)

The calculation of the Principal Component Regression coefficients can be obtained by the Singular Value Decomposition of the X math treated matrix, where we obtain the scores of the samples, multiplying the "u" and "d" matrices (u %*% d is what we call often the T matrix), and after, we make a regression of the constituent (Y matrix) vs. the scores (used as independent variables) to obtain the regression coefficients of the scores.

After we multiply the regression coefficients of the scores by the loadings "v" (P transposed matrix) to obtain the regression coefficients of the principal components regression.

This is the long way to do a PCR, but the quicker option is to use the "pcr" function from the R "pls" package.

## The short way:
library(pls)
Xodd_pcr2<-pcr(Prot[odd,] ~ X_msc[odd,],ncomp=5)
matplot(wavelengths,Xodd_pcr2\$coefficients[,,5],

lty=1,pch=NULL,type="l",col="red",
xlab="Wavelengths",
ylab="Regression Coefficients")
We can look to the summary:﻿
summary(Xodd_pcr2)
```Data:  X dimension: 327 100
Y dimension: 327 1
Fit method: svdpc
Number of components considered: 5
TRAINING: % variance explained
1 comps  2 comps  3 comps  4 comps  5 comps
X             99.516    99.87    99.97    99.99    99.99
Prot[odd, ]    3.934    15.34    19.31    65.45    68.17```
` `
` `
We have develop the PCR models with the half of the total database which we has split in two (odd and even) in the post "Splitting spectral data into training and test sets".

As we can see with just 1 PC component we explain more than 99,5% of the variance in X, but just a few of the Y variable (Protein in this case). We need to add more than just the first term (at least 4) to succeed in the calibration, but the cost is that we can add noise and over fit the model.

The external validation (even samples) set will help us to see the performance of the model.