20 may. 2018

Monitoring LOCAL calibrations (part 2)

Following with the previous post when monitoring LOCAL equations we have to consider several points. Some of them could be.
  1. Which is the reference method and what is the laboratory error for the reference method. Ex:Reference method for protein can be from Dumas or from Kjeldahl .
  2. Do you have data from several laboratories or from just one?. Which reference method use every lab and what is their laboratory error. 
  3. If we have data in the Monitor from both methods, or from different laboratories, split them apart and see which error  SEP do we have. Check also the SEP corrected by the Bias.
  4. Check the performance of the SEP and other statistics for every specie. In meat meal for example you have pork, mix, poultry, feather,....
Sure you can find more knowledge about the data:
  1. Do we have more error with one method than another, with one lab than another?.
  2. Does perform a reference method better for one particular specie than another?.
  3. ................................
Make conclusions from these and other thoughts.
For all these, is important to organize well the data and use it fine. Don´t forget that we are comparing reference versus predicted, and that the prediction depends of the model itself apart from the reference values, so we have to be sure that our model is not overfitted or wrong developed.

19 may. 2018

Monitoring LOCAL calibrations

Databases to develop Local calibrations has normally a high quantity of spectra with lab values, but we have to take care of  them adding new sources of variance. This way we make them more robust and the standard prediction errors (SEP) decrease when we validate with future independent validation sets.
This was the case with a meat meal local database updated with 100 samples with protein values, and with new source of variation as: laboratories,  reference methods (Kj, Dumas), providers, new instruments,...
After the LOCAL database update, a set of new samples was received with reference values and I have predicted this values with the Monitor function in Win ISI with the LOCAL database before (blue dots) and after the LOCAL update (red dots).
The errors decrease , specially for some types of samples, in an important way when validating with the new set of samples (new samples acquired after the updated Local calibration was installed in the routine software), so even if we have spectra from this instrument, labs, ...., this set has to be considered as an independent set.
I don´t give details of the statistics but this picture show the same samples predicted with the LOCAL calibration without update (in blue), and predicted with the LOCAL calibration update (in red), the improvement for the prediction for some samples is important, so the idea is to add this new samples and continuing monitoring the LOCAL database with future validation sets.

17 may. 2018

Nils Foss died on 16-May-2018

Sad news from FOSS: Nils Foss (Foss founder) died yesterday.
He was an good example of entrepreneur.

15 may. 2018

Random selection before calibration.

Following a course about Machine Learning with R, I realize of the importance of the random selection of the samples for the development of the models.
R has good tools to select the samples randomly and to do it was a common practice during all the exercises in the course.
I do it, in the case with the soy meal samples I have used for several posts, so we will compare the results.
The idea of random selection is to make the model robust against the bias which we see quite often when validating the models with independent samples.
We can see if the number of the terms selected change or if we get similar results to the previous selection using an structure selection of odd and even samples.
Random selection order is important also for a better cross validation.
Here is the code and preparation of the sample sets for the development of the models.

##################### SPLITTING SAMPLES RANDOMLY #############  
#In this case we need first the dataframe "soy_ift_conv"
#Split the data into 65% training and 35% test

#Plotting Training and Validation sets overlapped.


We see in gray the validation samples selected and in blue the training samples.


2 may. 2018

First approach to ANN calibrations

This is a first approach to develop ANN calibrations with the soymeal data from Infratec and it is really promising.
I follow this code and plot the results:

#We build a dataframe with the constituent values
#and with the spectra math treated with MSC
Y<-as.matrix(sm_ift[,2])        # La matriz Y son los datos de proteina.
which(is.na(Y), arr.ind=TRUE)   # La 462 tiene por valor NA y la quitamos
X<-as.matrix(sm_ift[,6:105]) # La matriz X son los datos NIR
##PRINCIPAL COMPONENTS ANALYSIS using package "Chemometrics" NIPALS)

#' Split the data into 65% training and 35% test

#' Create an neural network model with 3 hidden
#' nodes of Y~. using the training data.
#' Store this model as 'model'

#' Use this model to estimate the Y values of the test data
#' Calculate the MSE of the model on the test data and output
#' it using the print or cat functions
cat("The MSE of the model on the test data is: ",mse,"\n")
#The MSE of the model on the test data is:  0.9314747

The loss function in this case is MSE (Mean Square Error).
More practice these coming days

30 abr. 2018

Validation with LOCAL calibrations

When developing a LOCAL calibration in Win ISI, we use an input file and a Library file with the idea the idea to select the best possible model, so the selection of a well representative input (let´s call validation set) is very important to have success in the development of the model.
So the model is conditioned to the input file, if we have choose another input file we could have get another model which performs different, so the need of a Test Set is obvious to check how the model performs with new data.
It is important to have this in mind, so one proposal would be to divide (randomly) the data in three sets: 60% for training, 20% for Input or Validation, and another 20% for testing.
There are other ways to sort the data in order to select these three Sets (time, seasons, species,...). One thing is clear, some of the models developed will perform better than others, so you can keep several of them and you can check this when you have new data and use an statistic an MSE (Mean Square Error) to compare them.

19 abr. 2018

Projections over the planes in PC space

This is a plot in 3D to see the orthogonal projections over the plane formed by first and second PC calculated with SVD. Once projected over the plane, projections arte projected again on the new axis (all them: terms or loadings,.....) to calculate the score for every PC.
Plot can be moved with the mouse so find the better view.

Data is centered so we see samples on both sides of the plane.

17 abr. 2018

Covariance plot and Mahalanobis ellipse

In the previous post we have seen  the correlation plot and in this one the covariance plot  matrix, we just have to change the function "cor" for "cov" in the code.
Covariance  matrix is used frequently in chemometrics, like in this case to define the ellipse for the Mahalanobis distance using the covariance between the two axes (one for the 964 wavelength and the other the linear combination of wavelengths 1022 and 902 that we have seen in other recent post.
We can do this easily with the package chemometrics and this code:



to get this plot:

16 abr. 2018

Correlation Matrix plot

In the last posts we are talking about the wavelength space due to its high collinearity, because we want to select wavelengths with few correlation between them in order to develop a model.
In this task we can check the correlation matrix, which is better to check in a plot than with numbers. This is the plot for the soy meal samples in transmitance using the 100 wavelengths from 850 to 1048 nm in steps of 2 nm, so the correlation matrix is a 100.100 diagonal and symmetric matrix as can be seen in the plot.
The red line is the correlation of the 962 nm wavelength with all the rest including itself (1 in this case). The vertical blue lines are the wavelengths at 1022,902 and 962 used in the recent posts.
See the Correlation matrix plot and code: 





14 abr. 2018

ellipses and ellipsoids in the wavelength space

Let´s look to the plane from the previous post:

We can see how the dots form like an ellipse, and this is a characteristic when plotting some wavelengths versus others.

In this case we see the ellipse in a plane but we can see them as well as ellipsoids in 3D or more dimensions.

13 abr. 2018

Regression planes in R (wavelength space)

We have seen the high correlation between wavelengths in previous posts and how we can reduce the wavelength space from lower dimension plots. 

In the case of the 3 wavelengths selected in the previous post at 1022, 964 and 902 nm. 

We can use the wavelength at 1022 nm as the dependent variable and to calculate a MLR regresión to predict the absorbances at this wavelength as a linear combination of the absorbances at 902 and 964 nm calculating a regression plane;

    #1022nm     datapoint 87    
  # 902nm     datapoint 27  
  # 964nm     datapoint 58   

s3d<-scatterplot3d(x1,x2,x3,pch=16,highlight.3d = TRUE,
                   angle=330,xlab="902 nm",
                   ylab="964 nm",zlab="1022 nm")       
s3d$plane3d(fit,lty.box = "solid")

We can see the plane looking to the new regression plane plot:

8 abr. 2018

Intercorrelation between constituents and wavelengths

In the post "Linear Combinations to improve correlation" we selected three wavelengths where normally we found overtones for Protein, Fat and Fiber (Cellulose), and we found a linear combination of these three wavelengths to improve the correlation with the Protein constituent.
In this one we continue with the study of this data set. We have not just the Protein constituent, we have the values for Moisture, Fiber and Fat as well in the "Y" matrix and we want to check the inter-correlation between the constituents.
In the case we have NA values for some sample we won´t get the correlation value, and this is the case, so we remove first the samples with NA values and calculate the correlation between the constituents:


           Protein     Moisture     Fiber       Fat
Protein   1.0000000  0.15406257 -0.83770097 -0.16767249
Moisture  0.1540626  1.00000000 -0.16143400 -0.09022824
Fiber    -0.8377010 -0.16143400  1.00000000  0.07874072
Fat      -0.1676725 -0.09022824  0.07874072  1.00000000

We see how we find a high negative correlation between Fiber and Protein, but this is normal in the case of soy meal.
But do we have high correlation between the absorbance of the wavelengths we associate with Protein and Fiber.

#1022nm  datapoint 87       Protein or Oil
#1008nm  datapoint 80       Oil or Water
# 996nm                     Oil or Water
# 902nm  datapoint 27       Cellulose      
# 964nm  datapoint 58       CH2 Oil       

> cor(X_msc[,87],X_msc[,27])
  [1] -0.9957298
plot(x1,x2,xlab="1022 nm  Protein?",ylab="902 nm Fibre?",
     col="green",main = "cor(X_msc[,87],X_msc[,27])")

7 abr. 2018

Linear combinations to improve correlation

With the data from soy meal on IFT conveyor, I select three wavelengths for this demo:
# 1022nm     datapoint 87     Protein or Oil  
#  902nm     datapoint 27     Cellulose        
#  964nm     datapoint 58     CH2 Oil


We have the values for Protein for these spectra.

Protein <- Prot

Let´s see the wavelengths in the mean centered MSC treated spectra


now see how the correlation becomes better in the case of the 4td plot where the X axis is a linear combination of the other 3 wavelengths
We have worked with this data before with PLS and PCR and what we have done here is a MLR approach. An intercept value will place the date on the same scale.

5 abr. 2018

X spectra matrix redundancy

We know about the redundancy in the columns of the wavelength matrix X, where there is a high correlation between many of the wavelength so we should create new matrices to  represent the linear combinations of the X matrix.
In this picture, I select 3 wavelengths  of the 100 and we can see how two of them are highly correlated and the information can be represented in a plane indeed in a cube, so we change the space from 3 variables to just 2.
Columns X1 and X3 has a high correlations so they are dependent vectors because they go in the same direction. X2 goes in a different  direction so it is independent  from X1 or X3.

2 abr. 2018

Linear Regressions to calculate the Loadings

We saw in previous post how the scores are calculated for every PLS term once we have the weights, so we start by t1 (scores on the first PLS term), and after we calculate the Loadings as regressions of X on T, so the loadings are spectra with values of "b" coefficients indeed absorbance. So they give us an idea of the importance of every wavelength in the term.
Every spectrum has 100 wavelength (850 to 1048 every 2 nm) so 850 is in the data point 1, 852 in the data point 2,........and 1048 in the data point 100.
So, we can make linear regressions on the X matrix (with its math treatment applied) on the score t1 calculated to get the value of every data point in the first loading.
The way to do it is:

lm(X_msc_centered[odd,10]~t1_pls)     #b=-1.247e-01
lm(X_msc_centered[odd,50]~t1_pls)     #b=-6.269e-02
lm(X_msc_centered[odd,90]~t1_pls)     #b= 1.348e-01

We can check after how this values are the same as the calculated for the PLS algorithm.

25 mar. 2018

What condition must have the "loadings" in PLSR?

In the case of the loadings for PLS we do not have the condition of orthogonality as in the weights or scores, even if they are coming from orthogonal scores.
So the product of the matrices Pt.P is not a diagonal matrix, and this condition is what makes PLS regression special and different from Principal Components Regression.

Let´s check:

 > round(t(p.matrix) %*% p.matrix,4)
              Xodd_pls3_ld1 Xodd_pls3_ld2 Xodd_pls3_ld3 Xodd_pls3_ld4
Xodd_pls3_ld1        1.0132       -0.1151        0.0000        0.0000
Xodd_pls3_ld2       -0.1151        1.2019       -0.4493        0.0000
Xodd_pls3_ld3        0.0000       -0.4493        1.8783       -0.9372
Xodd_pls3_ld4        0.0000        0.0000       -0.9372        1.0134

We can see the red numbers that makes that we don´t get the condition of orthogonality, and how the loadings 2 and 3 are specially involved, and these is because in this case the loadings 1 and 4 are very similar to the weights 1 and 4, and 2 and 3 are quite different.

So loadings 1 and 4 are orthogonal between them. So there is no condition for the loadings.

23 mar. 2018

Gilber Strang Lesson: "Singular Value Decomposition (SVD)"

I really like the way this professor explain the lessons. We are using quite a lot SVD in R, for example to calculate the weights we are talking about quite often lately. Enjoy the lesson  and I wish all the readers a nice weekend

What condition must have the scores in PLSR?

We have seen that the weights are orthogonal in PLSR regression (What condition must have the weights in PLSR?), so the scores, which are projections of the spectra over the weights must be orthogonal as well.

Let´s do the same operations than "What condition must have the weights in PLSR?":

round(t(t.matrix) %*% t.matrix,4)

         [,1]   [,2]   [,3]   [,4]
[1,] 146.3176 0.0000 0.0000 0.0000
[2,]   0.0000 0.4371 0.0000 0.0000
[3,]   0.0000 0.0000 0.0634 0.0000
[4,]   0.0000 0.0000 0.0000 0.0639
As we see we obtain a diagonal matrix, so the condition of orthogonality is fine.
But the difference with the weights, is that the weights, apart from being orthogonal, are  orthonormal, because are normalized to length one.
> round(t(w.matrix) %*% w.matrix,4)
Xodd_pls3_w1 Xodd_pls3_w2 Xodd_pls3_w3 Xodd_pls3_w4 Xodd_pls3_w1 1 0 0 0 Xodd_pls3_w2 0 1 0 0 Xodd_pls3_w3 0 0 1 0 Xodd_pls3_w4 0 0 0 1

22 mar. 2018

What condition must have the weights in PLSR?

Well, this can be an easy question: "they are orthogonal between them", anyway I just want to check it, and the simple way is to combine all the weights in a matrix, and to multiply:

(Weight.matrix transposed)  * Weight.matrix

and the result must be the Identity Matrix. So let´s check it:

round(t(w.matrix) %*% w.matrix,4)

> round(t(w.matrix) %*% w.matrix,4)
            Xodd_pls3_w1 Xodd_pls3_w2 Xodd_pls3_w3 Xodd_pls3_w4 Xodd_pls3_w1   1 0     0 0 Xodd_pls3_w2   0 1     0 0 Xodd_pls3_w3   0 0     1 0 Xodd_pls3_w4   0 0     0          1
So the condition of orthogonality  is fine.

19 mar. 2018

Understanding PLS Regression in a diagram

This is a first part of the diagram of how some of the inside parts of the algorithms works. the diagram still needs to grow with new steps adding new terms (this is just for the first term), and with the matrices "F" and "q".
The idea is to link all these diagrams to the Win ISI diagram.
Everything start with my previous posts:

How the Weights are calculated in PLS?
How the scores are calculated in PLS ?
How the loadings are calculated in PLS?

18 mar. 2018

How the Weights are calculated in PLS

We have seen how the scores and loadings are calculated when developing a PLS regression, but what about the Weights?.
To calculate the X scores and the loadings we need the weights, and in particular the first one (to start all the process), after this calculation the scores are calculated as projections of the spectra on the weights, and after the scores are calculated we calculate the loadings.

The first weight is calculated a the Singular Value Decomposition of the cross product of X centered and transposed and Y, beeing X the spectra matrix and Y the constituent matrix.

When we do this we get (as we saw in other posts): "u", "d" and "v".
The first weight are the values in "u".

Xt<-t(X_msc_centered[odd,])  #100.327
Y<-as.matrix(Prot[odd,])     #327.1
Xt.Y<-Xt%*%Y                 #100.1
dim(svd.XtY$u) #100.1

We continue in next post with the reconstruction of all the PLS process.

17 mar. 2018

How the loadings are calculated in PLS?

After seeing "How the scores are calculated in the PLS" (in particular the scores for the first term), now I follow with "How the loadings are calculated in the PLS?" (and in particular the first loading.

Loadings are calculated as regressions of T on every column of X, so in this case we regress "t1_pls" (calculated on "How the scores are calculated in the PLS?") on X, and we do it in a matrix way and we scale it dividing by t(t1_pls)*t1_pls which is an scalar.

Xt.t1_pls<-t(X_msc_centered[odd,]) %*% t1_pls          

Now we can plot the reconstructed one with this code and the first loading we get from the "plsr" function of the pls package.

        col="red",xlab="Wavelengths",ylab="Loadings Int.",

we get TRUE for all the values.

How the scores are calculated in PLS

The first step in the PLS Regression is to calculate the weights, and as we said, there is one weight for every term of the regression, and we get them in the pls argument $weights, so in the case of the regression of the previous post the first weight is:

Now we have to project our samples spectra over this first weight to get the scores that we store in a vector "t_1", and with all the vectors (t_1, t_2,...., t_a) we will get the "T" matrix of pls scores.

To project the samples is a matrix multiplication:

 t_1= X.w1

Where X is the spectra matrix treated with MSC and centered and w1 is the first weight. 

As you know, to multiply matrices the number of columns of the first matrix must be the same as the number of rows of the second matrix.
So let´s check if our calculated values are the same as the PLS regression scores for the first PLS term:

dim(Xodd_pls3_w1)               #100*1

dim(X_msc_centered[odd,])       #327*100
t1_pls<-X_msc_centered[odd,] %*% Xodd_pls3_w1     
#We get TRUE for all the values

13 mar. 2018

PLS regression (Loadings and Weights)

I have talk about Loadings in many posts, but the Weights are very important and give a lot of information and I have not talked about them.
There are as many weights as Loadings (one for every term)..... But,... how they looks?. Are they similar to the loadings?, How they contribute to the Model?.
Sure there are many questions about the Weights that I would try to understand and share to you in next posts.
By the moment se the firsts four Loadings and Weights compared in 4 plots with the soy meal data in the conveyor:

12 mar. 2018

Thanks for the interest in the Monitor function

Hi all, and thanks for following the posts and for your interest. I see that some of you ask where is the monitor function I use in some of the posts. I am developing this function from long time ago, and I had a gap in time but now I am testing it quite often and try to make it more robust.
There is a label called Monitor Function, where I explain some things while I was developing  it. Hope to have it son ready to share with the R community interested in it.
Here are the four plot it gives, in general, and the function can be split it in four to have just one plot.

PCR vs. PLS (part 7 - cont(b))

Apart from the Odd and Even sample Sets, I have another sample set which was fully independent from the others (samples from different origins, lab values from two different labs, and acquired in a different instrument) and I call it "Test", so we can over-plot their RMSEP  for the first 10 terms for the PLS and PCR regressions, so this way we have a, idea of the performance of the calibrations in the new instrument to which we want to transfer the calibration without using any type of standardization.
So we have to add some code to the previous post and change the legend:




legend("topright",legend=c("RMSECV odd PCR",

       "RMSECV odd PLS","RMSEP even PCR",
       "RMSEP even PLS","RMSEP test PCR",
       "RMSEP test PLS"),


Finally we get this plot which is very useful to see the performance of the models:

11 mar. 2018

PCR vs. PLS (part 7 - cont(a))

This post continue with the plots we have seen in PCR vs. PLS (part 7) where we saw the cross validation error for the training samples (odd samples from the soy meal), with the models developed with PCR and PLS.
Now we want to check how these models performs with the even samples that we have place apart in a test set.
Now we overlap the RMSEP statistics for the test set  (for PLS and PCR models).

#from: PCR vs. PLS (part 7)
## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)

     main=" Corr & RMSEP PCR vs PLS")
plot(Xodd_pls3,"validation", estimate="CV",
 ## New code added






legend("bottomleft",legend=c("PCR-RMSECV train set",
       "PLS-RMSECV train set","PCR-RMSEP even set",
       "PLS-RMSEP even set"),


As we can see the Model with PLS performs better for the Even sample set and also for the cross validation.

7 mar. 2018

Variable Selection (PLS_Toolbox option)

See how the statistics can be improved (for this beer data set) with the variable selection option, Normally we tend to use always the full spectra options, but there can be some cases where we can get better statistics for validation with the variable selection options. To do this in a manual way will be a lot of work of trial and error, so an algorithm to find them is really useful.
I really like all the plots and figures of this software, and can give some ideas when using R. The software seems to have a lot of chemometrics tools, so it is interesting to look to other videos of this software. You are welcome to give your comments if you use this software.

5 mar. 2018

PCR vs. PLS (Part 11) - Independent Validation

One of the interesting things of the development of a calibration for a particular application (meat meal in this case) is to install  it to work in routine and to wait for a long period of time to have a validation set which is really independent, so we can have a true statistic for the application error.
If you have followed the previous post, the calibration for meat meal has been developed in a DA1650, and we have used only the odd data from a sample set as training set, leaving the even samples for validation, so the calibration has almost 650 samples from the 1300 possible. The constituent we are interested to calibrate is the ash, but of course the others as protein, fat and moisture are interesting.
we try to develop a robust regression, so the number of terms is important and we don´t want to clean the outlier samples too much.
The range is quite wide, because we have samples from different sources (instruments, species, lab reference values,....).
In the previous post we saw the statistics, and they could seem quite high, but let´s see the validation.
The validation set is from one of the sources (one of the DA instruments, mix meat meat, and one of the labs). The samples from this source are previous to 2017, so the validation samples were acquired during 2017 and the firsts months of 2018. So this is an interesting set to check.
We will use the calibration (as I said) from the previous post which use 15 terms.
In gray we see the XY plot of the Cross Validation for the training set, and in blue how the new validation samples fit to the regression line.

legend("topleft",legend=c("Odd 15 terms",

"VAL-2017-MIX"," VAL-2017-MIX-RMSEP 15t=1,47"),
col=c("grey","blue","blue"),pch=c(1,1,1), cex=1,
It is working well in routine, and I am quite happy with the result.