19 mar. 2018

Understanding PLS Regression in a diagram

This is a first part of the diagram of how some of the inside parts of the algorithms works. the diagram still needs to grow with new steps adding new terms (this is just for the first term), and with the matrices "F" and "q".
The idea is to link all these diagrams to the Win ISI diagram.
Everything start with my previous posts:

How the Weights are calculated in PLS?
How the scores are calculated in PLS ?
How the loadings are calculated in PLS?

18 mar. 2018

How the Weights are calculated in PLS

We have seen how the scores and loadings are calculated when developing a PLS regression, but what about the Weights?.
To calculate the X scores and the loadings we need the weights, and in particular the first one (to start all the process), after this calculation the scores are calculated as projections of the spectra on the weights, and after the scores are calculated we calculate the loadings.

The first weight is calculated a the Singular Value Decomposition of the cross product of X centered and transposed and Y, beeing X the spectra matrix and Y the constituent matrix.

When we do this we get (as we saw in other posts): "u", "d" and "v".
The first weight are the values in "u".

Xt<-t(X_msc_centered[odd,])  #100.327
Y<-as.matrix(Prot[odd,])     #327.1
Xt.Y<-Xt%*%Y                 #100.1
dim(svd.XtY$u) #100.1

We continue in next post with the reconstruction of all the PLS process.

17 mar. 2018

How the loadings are calculated in PLS?

After seeing "How the scores are calculated in the PLS" (in particular the scores for the first term), now I follow with "How the loadings are calculated in the PLS?" (and in particular the first loading.

Loadings are calculated as regressions of T on every column of X, so in this case we regress "t1_pls" (calculated on "How the scores are calculated in the PLS?") on X, and we do it in a matrix way and we scale it dividing by t(t1_pls)*t1_pls which is an scalar.

Xt.t1_pls<-t(X_msc_centered[odd,]) %*% t1_pls          

Now we can plot the reconstructed one with this code and the first loading we get from the "plsr" function of the pls package.

        col="red",xlab="Wavelengths",ylab="Loadings Int.",

we get TRUE for all the values.

How the scores are calculated in PLS

The first step in the PLS Regression is to calculate the weights, and as we said, there is one weight for every term of the regression, and we get them in the pls argument $weights, so in the case of the regression of the previous post the first weight is:

Now we have to project our samples spectra over this first weight to get the scores that we store in a vector "t_1", and with all the vectors (t_1, t_2,...., t_a) we will get the "T" matrix of pls scores.

To project the samples is a matrix multiplication:

 t_1= X.w1

Where X is the spectra matrix treated with MSC and centered and w1 is the first weight. 

As you know, to multiply matrices the number of columns of the first matrix must be the same as the number of rows of the second matrix.
So let´s check if our calculated values are the same as the PLS regression scores for the first PLS term:

dim(Xodd_pls3_w1)               #100*1

dim(X_msc_centered[odd,])       #327*100
t1_pls<-X_msc_centered[odd,] %*% Xodd_pls3_w1     
#We get TRUE for all the values

13 mar. 2018

PLS regression (Loadings and Weights)

I have talk about Loadings in many posts, but the Weights are very important and give a lot of information and I have not talked about them.
There are as many weights as Loadings (one for every term)..... But,... how they looks?. Are they similar to the loadings?, How they contribute to the Model?.
Sure there are many questions about the Weights that I would try to understand and share to you in next posts.
By the moment se the firsts four Loadings and Weights compared in 4 plots with the soy meal data in the conveyor:

12 mar. 2018

Thanks for the interest in the Monitor function

Hi all, and thanks for following the posts and for your interest. I see that some of you ask where is the monitor function I use in some of the posts. I am developing this function from long time ago, and I had a gap in time but now I am testing it quite often and try to make it more robust.
There is a label called Monitor Function, where I explain some things while I was developing  it. Hope to have it son ready to share with the R community interested in it.
Here are the four plot it gives, in general, and the function can be split it in four to have just one plot.

PCR vs. PLS (part 7 - cont(b))

Apart from the Odd and Even sample Sets, I have another sample set which was fully independent from the others (samples from different origins, lab values from two different labs, and acquired in a different instrument) and I call it "Test", so we can over-plot their RMSEP  for the first 10 terms for the PLS and PCR regressions, so this way we have a, idea of the performance of the calibrations in the new instrument to which we want to transfer the calibration without using any type of standardization.
So we have to add some code to the previous post and change the legend:




legend("topright",legend=c("RMSECV odd PCR",

       "RMSECV odd PLS","RMSEP even PCR",
       "RMSEP even PLS","RMSEP test PCR",
       "RMSEP test PLS"),


Finally we get this plot which is very useful to see the performance of the models:

11 mar. 2018

PCR vs. PLS (part 7 - cont(a))

This post continue with the plots we have seen in PCR vs. PLS (part 7) where we saw the cross validation error for the training samples (odd samples from the soy meal), with the models developed with PCR and PLS.
Now we want to check how these models performs with the even samples that we have place apart in a test set.
Now we overlap the RMSEP statistics for the test set  (for PLS and PCR models).

#from: PCR vs. PLS (part 7)
## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)

     main=" Corr & RMSEP PCR vs PLS")
plot(Xodd_pls3,"validation", estimate="CV",
 ## New code added






legend("bottomleft",legend=c("PCR-RMSECV train set",
       "PLS-RMSECV train set","PCR-RMSEP even set",
       "PLS-RMSEP even set"),


As we can see the Model with PLS performs better for the Even sample set and also for the cross validation.

7 mar. 2018

Variable Selection (PLS_Toolbox option)

See how the statistics can be improved (for this beer data set) with the variable selection option, Normally we tend to use always the full spectra options, but there can be some cases where we can get better statistics for validation with the variable selection options. To do this in a manual way will be a lot of work of trial and error, so an algorithm to find them is really useful.
I really like all the plots and figures of this software, and can give some ideas when using R. The software seems to have a lot of chemometrics tools, so it is interesting to look to other videos of this software. You are welcome to give your comments if you use this software.

5 mar. 2018

PCR vs. PLS (Part 11) - Independent Validation

One of the interesting things of the development of a calibration for a particular application (meat meal in this case) is to install  it to work in routine and to wait for a long period of time to have a validation set which is really independent, so we can have a true statistic for the application error.
If you have followed the previous post, the calibration for meat meal has been developed in a DA1650, and we have used only the odd data from a sample set as training set, leaving the even samples for validation, so the calibration has almost 650 samples from the 1300 possible. The constituent we are interested to calibrate is the ash, but of course the others as protein, fat and moisture are interesting.
we try to develop a robust regression, so the number of terms is important and we don´t want to clean the outlier samples too much.
The range is quite wide, because we have samples from different sources (instruments, species, lab reference values,....).
In the previous post we saw the statistics, and they could seem quite high, but let´s see the validation.
The validation set is from one of the sources (one of the DA instruments, mix meat meat, and one of the labs). The samples from this source are previous to 2017, so the validation samples were acquired during 2017 and the firsts months of 2018. So this is an interesting set to check.
We will use the calibration (as I said) from the previous post which use 15 terms.
In gray we see the XY plot of the Cross Validation for the training set, and in blue how the new validation samples fit to the regression line.

legend("topleft",legend=c("Odd 15 terms",

"VAL-2017-MIX"," VAL-2017-MIX-RMSEP 15t=1,47"),
col=c("grey","blue","blue"),pch=c(1,1,1), cex=1,
It is working well in routine, and I am quite happy with the result.