19 mar. 2018

Understanding PLS Regression in a diagram

This is a first part of the diagram of how some of the inside parts of the algorithms works. the diagram still needs to grow with new steps adding new terms (this is just for the first term), and with the matrices "F" and "q".
The idea is to link all these diagrams to the Win ISI diagram.
Everything start with my previous posts:

How the Weights are calculated in PLS?
How the scores are calculated in PLS ?
How the loadings are calculated in PLS?

18 mar. 2018

How the Weights are calculated in PLS

We have seen how the scores and loadings are calculated when developing a PLS regression, but what about the Weights?.
To calculate the X scores and the loadings we need the weights, and in particular the first one (to start all the process), after this calculation the scores are calculated as projections of the spectra on the weights, and after the scores are calculated we calculate the loadings.

The first weight is calculated a the Singular Value Decomposition of the cross product of X centered and transposed and Y, beeing X the spectra matrix and Y the constituent matrix.

When we do this we get (as we saw in other posts): "u", "d" and "v".
The first weight are the values in "u".

Xt<-t(X_msc_centered[odd,])  #100.327
Y<-as.matrix(Prot[odd,])     #327.1
Xt.Y<-Xt%*%Y                 #100.1
dim(svd.XtY$u) #100.1

We continue in next post with the reconstruction of all the PLS process.

17 mar. 2018

How the loadings are calculated in PLS?

After seeing "How the scores are calculated in the PLS" (in particular the scores for the first term), now I follow with "How the loadings are calculated in the PLS?" (and in particular the first loading.

Loadings are calculated as regressions of T on every column of X, so in this case we regress "t1_pls" (calculated on "How the scores are calculated in the PLS?") on X, and we do it in a matrix way and we scale it dividing by t(t1_pls)*t1_pls which is an scalar.

Xt.t1_pls<-t(X_msc_centered[odd,]) %*% t1_pls          

Now we can plot the reconstructed one with this code and the first loading we get from the "plsr" function of the pls package.

        col="red",xlab="Wavelengths",ylab="Loadings Int.",

we get TRUE for all the values.

How the scores are calculated in PLS

The first step in the PLS Regression is to calculate the weights, and as we said, there is one weight for every term of the regression, and we get them in the pls argument $weights, so in the case of the regression of the previous post the first weight is:

Now we have to project our samples spectra over this first weight to get the scores that we store in a vector "t_1", and with all the vectors (t_1, t_2,...., t_a) we will get the "T" matrix of pls scores.

To project the samples is a matrix multiplication:

 t_1= X.w1

Where X is the spectra matrix treated with MSC and centered and w1 is the first weight. 

As you know, to multiply matrices the number of columns of the first matrix must be the same as the number of rows of the second matrix.
So let´s check if our calculated values are the same as the PLS regression scores for the first PLS term:

dim(Xodd_pls3_w1)               #100*1

dim(X_msc_centered[odd,])       #327*100
t1_pls<-X_msc_centered[odd,] %*% Xodd_pls3_w1     
#We get TRUE for all the values

13 mar. 2018

PLS regression (Loadings and Weights)

I have talk about Loadings in many posts, but the Weights are very important and give a lot of information and I have not talked about them.
There are as many weights as Loadings (one for every term)..... But,... how they looks?. Are they similar to the loadings?, How they contribute to the Model?.
Sure there are many questions about the Weights that I would try to understand and share to you in next posts.
By the moment se the firsts four Loadings and Weights compared in 4 plots with the soy meal data in the conveyor:

12 mar. 2018

Thanks for the interest in the Monitor function

Hi all, and thanks for following the posts and for your interest. I see that some of you ask where is the monitor function I use in some of the posts. I am developing this function from long time ago, and I had a gap in time but now I am testing it quite often and try to make it more robust.
There is a label called Monitor Function, where I explain some things while I was developing  it. Hope to have it son ready to share with the R community interested in it.
Here are the four plot it gives, in general, and the function can be split it in four to have just one plot.

PCR vs. PLS (part 7 - cont(b))

Apart from the Odd and Even sample Sets, I have another sample set which was fully independent from the others (samples from different origins, lab values from two different labs, and acquired in a different instrument) and I call it "Test", so we can over-plot their RMSEP  for the first 10 terms for the PLS and PCR regressions, so this way we have a, idea of the performance of the calibrations in the new instrument to which we want to transfer the calibration without using any type of standardization.
So we have to add some code to the previous post and change the legend:




legend("topright",legend=c("RMSECV odd PCR",

       "RMSECV odd PLS","RMSEP even PCR",
       "RMSEP even PLS","RMSEP test PCR",
       "RMSEP test PLS"),


Finally we get this plot which is very useful to see the performance of the models:

11 mar. 2018

PCR vs. PLS (part 7 - cont(a))

This post continue with the plots we have seen in PCR vs. PLS (part 7) where we saw the cross validation error for the training samples (odd samples from the soy meal), with the models developed with PCR and PLS.
Now we want to check how these models performs with the even samples that we have place apart in a test set.
Now we overlap the RMSEP statistics for the test set  (for PLS and PCR models).

#from: PCR vs. PLS (part 7)
## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)

     main=" Corr & RMSEP PCR vs PLS")
plot(Xodd_pls3,"validation", estimate="CV",
 ## New code added






legend("bottomleft",legend=c("PCR-RMSECV train set",
       "PLS-RMSECV train set","PCR-RMSEP even set",
       "PLS-RMSEP even set"),


As we can see the Model with PLS performs better for the Even sample set and also for the cross validation.

7 mar. 2018

Variable Selection (PLS_Toolbox option)

See how the statistics can be improved (for this beer data set) with the variable selection option, Normally we tend to use always the full spectra options, but there can be some cases where we can get better statistics for validation with the variable selection options. To do this in a manual way will be a lot of work of trial and error, so an algorithm to find them is really useful.
I really like all the plots and figures of this software, and can give some ideas when using R. The software seems to have a lot of chemometrics tools, so it is interesting to look to other videos of this software. You are welcome to give your comments if you use this software.

5 mar. 2018

PCR vs. PLS (Part 11) - Independent Validation

One of the interesting things of the development of a calibration for a particular application (meat meal in this case) is to install  it to work in routine and to wait for a long period of time to have a validation set which is really independent, so we can have a true statistic for the application error.
If you have followed the previous post, the calibration for meat meal has been developed in a DA1650, and we have used only the odd data from a sample set as training set, leaving the even samples for validation, so the calibration has almost 650 samples from the 1300 possible. The constituent we are interested to calibrate is the ash, but of course the others as protein, fat and moisture are interesting.
we try to develop a robust regression, so the number of terms is important and we don´t want to clean the outlier samples too much.
The range is quite wide, because we have samples from different sources (instruments, species, lab reference values,....).
In the previous post we saw the statistics, and they could seem quite high, but let´s see the validation.
The validation set is from one of the sources (one of the DA instruments, mix meat meat, and one of the labs). The samples from this source are previous to 2017, so the validation samples were acquired during 2017 and the firsts months of 2018. So this is an interesting set to check.
We will use the calibration (as I said) from the previous post which use 15 terms.
In gray we see the XY plot of the Cross Validation for the training set, and in blue how the new validation samples fit to the regression line.

legend("topleft",legend=c("Odd 15 terms",

"VAL-2017-MIX"," VAL-2017-MIX-RMSEP 15t=1,47"),
col=c("grey","blue","blue"),pch=c(1,1,1), cex=1,
It is working well in routine, and I am quite happy with the result.

Análisis de datos con R avanzado. Modulo II : Machine Learning

Ha habido gran interés en el primer video de esta serie, por lo que podéis continuar por este segundo, donde se ponen varios ejemplos prácticos para el uso de R en diferentes aplicaciones. El video es  bastante entretenido y pienso que incrementará el interés por R de los que estáis iniciándoos en este software.
Curioso lo que comenta respecto al modelo desarrollado en China .
Esperemos que no lo exporten.
Si no habéis visto el primer video, podéis ver el post anterior:
o en el mismo Channel 9.
Animaros y a usar R.

3 mar. 2018

PCR vs. PLS (part 10) Meat Meal

In an application for meat meal I proceed the same way as in the case of the soy meal in the Infratec. This time the instrument is a DA1650 NIR with a wavelength range from 1100 to 1650 nm. The spectra was math treated with SNV, Detrend and first derivative with data points every 2 nm. The parameter interested in for this post is the Ash.
Two calibration approaches had been done with the PCR and the PLS regressions. Spectra has been sorted by Ash reference value, so we can see in this plot ll the samples treated with SNV-DT and first derivative.
legend("bottomleft",legend=c("0-10% Ash",

       "10-20% Ash","20-30% Ash","30-40% Ash"),

       lty=c(1,1), cex=1)
I split the samples into odd and even , the same than in other examples and keep the even outside for external validation. The odd samples were used to develop the PCR and PLS Regressions with CV using 10 groups. Notice that the external validation with the even samples is like if we would use a CV with 2 groups.
Cross Validation (CV) help us to decide how many terms we would use for the development of the regression looking to the CV RMSEP graph, and in this case the graph is:
grid(nx = 30, ny = 6, col = "lightgray", lty = "dotted",
     lwd = par("lwd"), equilogs = TRUE)
legend("bottomleft",legend=c("PCR RMSEP","PLS RMSEP"),
The graph is really confuse in this case for the PCR, and we get worse results than for the PLSR. In the case of the PLSR, we see that 15 terms can be enough, but we can try with less  (10 for example) with the even validation set.


As we can see 15t works better than 10 for the even samples.

I have to say that in these sets there are samples from different origins, labs,...etc., and outliers have not been removed, so there is room for improvement.

26 feb. 2018

PCR vs. PLS (part 9)

One way to understand the high correlation for the 6th term vs. Ash parameter is to check the 6th term. Terms are spectra which reconstruct the original spectra and are included in the P matrix.
The original spectra is treated with first derivative where the peak maximum is converted to a zero crossing, so it is difficult to understand easily what it is going on. So we have to find to which wavelengths correspond the zero crossings.
We can check in the loading vector in which wavelengths the values goes from negative to positive or from positive to negative or to use the identify function in R like in this case to find the data points for the zero crossings and to check after to which wavelengths correspond those data points.

[1]    39    80    85    113   132   145   148   
nm    1176  1258  1268  1324  1362  1388  1394  
       182   193   200  214   217
nm    1462  1484  1498 1556  153
It is not the idea of this post a scientific research of course, but we can try to check in the bibliography to have an idea of what this bands are related to in order to understand better our model.
In this case 1484 and 1498 are normally bands assigned to cellulose.

The original spectra treated with the first derivative is:

Remember from the previous post that the ash content is 0 to 10 for blue spectra, 10 to 20 for green, 20 to 30 for orange and 30 to 40 for red, grey for samples with no ash reference value.
In this set we have mix meat meal, pork, poultry,...

25 feb. 2018

PCR vs. PLS (Part 8)

Trying to understand better the data set of meat meal, I gave the blue color to the samples from 0 to 10%, green to samples which has from 10% to 20%, orange color to samples from 20 to 30% and red to samples from 30 to 40%.
If we develop the PCA and apply the Mahalanobis Distance to different planes we get:

We can continue with the rest of combinations, trying to see which one gives the best  correlation or in this case the better pattern with the colors.
Developing the PCR Regression, we get the explained variance plot which can help us to see which PC has the highest correlation with the Ash constituent:
It is clear than 6th PC has the highest explanation for Ash, so let´s see the plane:
Ash is not an easy parameter to calibrate, and a high number of components are necessary for a calibration, but this one has the highest contribution in a PC regression in this case. Here are outliers that are not remove yet, but as soon they be removed, the correlation will improve a little bit.
> cor(NIR_princomp$scores[1:1280,6],Ash[1:1280,])
[1] 0.5487329
Really it is a high correlation for this parameter.


Machine learning with R

21 feb. 2018

Análisis de la adulteración leche en polvo por NIR

No existe demasiada documentación sobre los métodos discriminantes de Win ISI 4, por lo que trataré de poner la que vaya encontrando para que podáis acceder a ella desde el blog. En este caso se emplean los nuevos métodos para detectar adulteraciones de la melanina en leche en polvo y os lo podéis descargar desde este enlace.
El artículo está traducido al castellano.

20 feb. 2018

PCR vs. PLS (part 7)

We have used PLS with tis data set in the posts series "Analyzing Soy Meal in transmittance", so I just want to add in these coming posts a comparison between the regresiones developed in PLS and in PCR.
The best way to do it is comparing some plots and in this post I show the comparison of the PCR and PLS regression with the same number of terms and with the same calibration set (odd samples), and the same validation during the development of the model (Cross Validation).
The plot shows how (as expected), the first terms are more correlated for the constituent of interest (protein in this case) for the PLS terms than for the PCR, so the RMS are better, but as soon as more terms are added are almost the same (a little bit better for the PLS).
## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)
     main=" Corr & RMSEP PCR vs PLS")
plot(Xodd_pls3,"validation", estimate="CV",
## Plot the second plot and put axis scale on right
plot(terms,Xodd_pls3_cor,  xlab="", ylab="", ylim=c(0,1),
     axes=FALSE, type="l",lty=4,lwd=2, col="red")
plot(terms,Xodd_pcr3_cor,  xlab="", ylab="", ylim=c(0,1),
     axes=FALSE,type="l",lty=4,lwd=2, col="blue")
## a little farther out (line=4) to make room for labels
axis(4, ylim=c(0,1), col="black",col.axis="black",las=1)
legend("bottomright",legend=c("RMSEP PLS", "RMSEP PCR",
       "Corr PLS","Corr PCR"),
 A reason for the improvement in the PLS can be seen in the regression coefficients (with 5 terms), where the PLS (in red), are more sharp, and can give better estimations ( we try to analyze this in coming posts)

18 feb. 2018

PCR vs. PLS (Part 6)

After checking the RMSEP and correlation with different terms in the PCR I get this plot:
using this code:
par(mar=c(5, 4, 4, 6) + 0.1)

     xlim=c(1,10),lwd=2,xlab="Nº Terms",

     main=("CV vs Ext Validation & Corr"))
## Plot the second plot and put axis scale on right
plot(terms, Test_corr, pch=15,  xlab="", ylab="",

     ylim=c(0,1),axes=FALSE, type="b", col="green")
## a little farther out (line=4) to make room for labels
      axis(4, ylim=c(0,1), col="green",

legend("topright",legend=c("CV", "Ext Val","Corr"),
## In the plot 5 seems to be the best option

   for the number of terms
Now we can compare the results in the XY plot, as we saw in the post "PCR vs. PLS (part 5)" to get the conclusions.
The statistics for this external validation set with 5 terms are (Monitor function):

> monitor10c24xyplot(pred_vs_ref_test_5t)
Validation Samples  = 25 
Reference Mean  = 45.6236 
Predicted Mean  = 46.38495 
RMSEP    : 1.046828 
Bias     : -0.761354 
SEP      : 0.7332779 
Corr     : 0.9131366 
RSQ      : 0.8338185 
Slope    : 0.7805282 
Intercept: 9.418837 
RER      : 8.115895   Poor 
RPD      : 2.428305   Fair 
Residual Std Dev is : 0.633808 
    ***Slope/Intercept adjustment is recommended***
BCL(+/-): 0.3020428 
***Bias adjustment in not necessary***
Without any adjustment and using  SEP as std dev 
the residual distibution is: Residuals into 68% prob (+/- 1SEP) = 15 % = 60 Residuals into 95% prob (+/- 2SEP) = 21 % = 84 Residuals into 99.5% prob (+/- 3SEP) = 23 % = 92 Residuals outside 99.5% prob (+/- 3SEP) = 2 % = 8 With S/I correction and using Sres as standard deviation,
the Residual Distribution would be: Residuals into 68% prob (+/- 1Sres) = 20 % = 80 Residuals into 95% prob (+/- 2Sres) = 24 % = 96 Residuals into 99.5% prob (+/- 3Sres) = 25 % = 100 Residuals outside 99.5% prob (> 3Sres) = 0 % = 0

17 feb. 2018

PCR vs. PLS (part 5)

It´s time for an external validation with an independent test. Samples in the calibration are from soy meal scanned in a NIT instrument (850 to 1050 nm), but there was some years without testing the calibration with new samples so I have the ocasión to scan new samples in a new instrument (from a new generation). Of course the path-length for transmission must be configure to the same value and the sample must be representative.
The validation set has 25 samples were we try to find the wider range as possible. The reference values comes principally from two laboratories, but there are two samples from a different laboratory.
we develop in the lasts posts a PCR calibration where we saw that with 10 terms we obtained the better results, but maybe this value is high in order to make the calibration more transferable. The test validation will help us to check this.
If we don´t consider the Lab origin, and we over-plot the results of predictions over the XY calibration (including cross validation predictions), we get this plot:
There are some samples that fit quite well, others have a Bias, and other random error. So we want to understand better what is going on, so we can give a color depending of the lab which gives the reference value.

legend("topleft",legend=c("Lab1", "Lab2","Lab3"),

Well, we can take some conclusions from this plot, but we need to check the predictions with different number of terms to see if the Estándar Error of Prediction (with 10 terms the RMSEP is 1,65) decrease and the RSQ (with 10 terms 0,593) increase.

16 feb. 2018

PCR vs. PLS (part 4)

In the post "Splitting spectral data into training and test sets"  , we kept apart the even samples for a validation, so it is the time after the calculations in post (PCR vs PLS part 3) to see if the performance of the model, with this even validation sample set, is satisfactory.
First we can over-plot the predictions over the plot of the blue and red samples we saw in post (PCR vs PLS part 3), and the first impression is that they fit quite well, so the performance seems to be as expected during the development of the model.

legend("topleft",legend=c("Odd", "Odd CV","Even"),


The prediction error for the Even Test Set is:

[1]  0.9616

Probably this "even" sample set is not really an independent set, so we need to go a step farther and check the model with a really independent set with samples taken in a different instrument, with reference values from different laboratories,.....This will be the argument of the next part in these series about PCR and PLS regressions.

15 feb. 2018

Introduction to the dplyr R package

PCR vs. PLS (part 3)

A common way to develop the regressions with PCR or PLS is with "Cross Validation", where we divide the training data set into groups that you can select depending of the number of samples: In the case you have few samples you can select a high (10 for example) value and in the case you have a lot of samples, you can select a low value (for example 4). One group is keep out and the regression is developed with all the others. We use the group keep outside to validate repeating the process as many times as validation groups.
So Iin the case of 4 groups), we obtain 4 validation statistic values as RMSEP, RSQ, Bias,... for every group we use in the development of the regression, so we can see from which term the RMSEP start to become higher, or the RMSEP values are almost flat and make not sense to add more terms to the regression.
Really the cross validation help us to avoid over-fitting, but it helps also to detect outliers (good or bad outliers we can say).
It is interesting that the samples in the training set have a certain way of sorting or randomnes . We don´t want that there are similar samples in the training and validation set, but this require inspection of the spectra previous to the development of the regression looking for neighbors, or other sources of redundancy. In the case there are similar samples we want that they stay if possible in the same  group.
In the development of the PCR, we van select the Cross Validation option and also the number of segments or groups for validation, in this case 10:

Xodd_pcr3<-pcr(Prot[odd,] ~ X_msc[odd,],validation="CV",


If we see the plot of the RMSEP for the 30 terms, we see that the in the fourth the RMSEP decrease dramatically, but after 10 it does make not sense to use those terms, so we have to select 10 or less.
An external validation (if possible with more independent samples) can help us with the decision.
A particular case of Cross Validation is the Leave One Out validation, where a sample unique sample is keep out for validation and the rest stay for the calibration. The process is repeated until all the samples have been part of the validation process. So in this case there are the same numbers of segments or groups than number of samples. This process is quite interesting when we have few samples in the training set.

About the X-Y plots (Predicted vs Reference values), it is important to interpret them in the case of cross validation. There are plots which show the predicted values vs the reference values when all the samples are part of the regression (blue dots) and those plots are not realistic, so it is better to see the plot, when every dot (sample), has a value when it is not part of the model because it is  in the validation set (red dots), this way we have a better idea of the performance of the regression for future samples.

This is not a nice tutorial set where the samples fit well, but it is what you often fine in several applications where you want to try to get the best model for a certain product parameter and instrument.

eRum 2018 (European R users meeting)

The European R Users Meeting, eRum, is an international conference that aims at integrating users of the R language living in Europe. Although the useR! conference series also serve similar goals, but as it's alternating between Europe and USA (and more recently Australia in 2018), we decided to start another conference series in the years when the useR! is outside of Europe.

12 feb. 2018

To consider when validating with LOCAL

I talk in other posts about LOCAL calibration. Once the LOCAL model is placed for routine, and the time pass, is time for validation.
LOCAL may have several products merged in a unique database from which we have developed the calibration, so it is important to label every sample in the validation set with a value (in this case 1: Black, 2:Red and 3:Green) corresponding to 3 different types of pork meat meal.
The validation is just to compare the reference values from the laboratory and the predicted values from our Near Infrared Instrument in this case, and the best way to represent it is a XY plot. The constituents most important in meat meal are: Moisture, Protein, Fat and Ash.
A first look to the general XY plots will gives an idea how it works:
The plot tell us that we have to split the validation set into 3 validation sets one for every class to calculate the statistics for every one.
In the case of Protein we will get this validation:

For the Green product:
For the Red Product:
For the Black Product:

As we can see we have a significant Bias for product 1, but not for the other two products, so we can proceed with the validation adjustments (3 products linked to a LOCAL database, and one of them bias adjusted until the calibration is updated).
Plots with R

11 feb. 2018

PCR vs. PLS (part 2)

Continue from post: "PCR vs PLS (part 1)":

As we add more terms to the model (PCR or PLS) the Standard Error of Calibration decrease and decrease, for this reason is necessary to have a validation method to decide how many terms to keep.
Cross Validation, or an External Validation set are some of the options to use.
The "PLS" R package has the function "RMSEP" which give us the standard error of the calibration, and we can have an idea of which terms are important for the calibration.
Continuing with the post PCR vs PLS (part 1), one we have the model we can check the Standard Error of Calibration with the 5 terms used in the model.

RMSEP(Xodd_pcr, estimate="train")
(Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
      1.985        1.946        1.827        1.783        1.167        1.120 
As we can see the error decrease all time, so we can be tempted to use 5 or even more terms in the model.
One of the values of the  pcr function is "fitted.values" which is an array with the predictions depending of the number of terms.
As we can see in the RMSEP values, 4 seems to be a good number to choose, because there is a big jump in the RMSEP from R PC term 3 to PC term 4 (1.783 to 1.167), so we can keep the predicted values with this number of terms and compare it with the reference values to calculate other statistics.
Anyway these statistics must be considered as optimistic and we have to wait for the validation statistics (50% of the samples in the even position taken apart in the post "Splitting Samples into a Calibration and a Validation sets" .

Statistics with the Odd samples (the ones used to develop the calibration)
##The reference values are:
## We create a table with the sample number,

## reference and predicted values

Now using the Monitor function I have developed:

We can see other statistics as the RSQ ,Bias,...,etc.

Now we make the evaluation with the "Even" samples taken apart for the validation:

pred_vs_ref_val[is.na(pred_vs_ref_val)] <- 0  
                 #change NA by 0 for Monitor function

As we can see in this case the RMSEP is a Validation statistic, so it can be considered as the Standard Error of Validation, and its value it´s almost the same as the RMSEP for Calibration.
RMSEP Calibration .........1,17
RMSEP for Validation .....1,13