R & Chemometrics: enero 2018

31 ene 2018

Trying to understand the Loadings

The PLS regression in R, give the values of the loadings. In previous posts we decide to use 4 terms, so we have 4 loadings, one for every term.

See that the first term explain almost all the variability in X, while the rest explain quite few. This does not mean that we don´t need them and they are important.

The first loading looks like an spectrum of a sample, so can be related and affected by the scatter. The others look more similar to the regression coefficients we have seen in the previous post.

So we have lo look to the different loadings shapes carefully.

plot(Prot_plsr,"loadings",comps=1:4,legendpos="top",

lty=c(1,2,4,5),col=c(1,2,4,5),ylim=c(-0.3,0.7))

We can compare individually every loading with the regression coefficients, and we can see how the third and fourth loadings seem to contribute to the regression coefficients.

par(mfrow=c(2,2))
plot(Prot_plsr,"loadings",comps=2,legendpos="top",

lty=c(2),col=c(2),ylim=c(-0.3,0.7))
plot(Prot_plsr,"loadings",comps=3,legendpos="top",

lty=c(4),col=c(4),ylim=c(-0.3,0.7))
plot(Prot_plsr,"loadings",comps=4,legendpos="top",

lty=c(5),col=c(5),ylim=c(-0.3,0.7))
plot(coefficients[,,4],type="l")

30 ene 2018

This was useR!2017 - Use R 2018 is coming

Registration and talk proposals open Monday for useR!2018

Trying to understand the regression coefficients

When developing the regression protein in R, one of the values obtained are the regression coefficients which are used to calculate the protein value for every new sample. There is a regression coefficient for every wavelength, so we can make a plot for the regression coefficients which maybe help us to understand how the calibration is working, but is not always like this and sometimes are very difficult to understand or almost impossible.

In the case of the regression make (with 4 terms) for the soy meal in transmittance, we can see the coefficient values like this.

attach(Prot_plsr_r1)
coefficients[,,4]
wavelengths<-as.matrix(seq(850,1048,by=2))
matplot(wavelengths,coefficients[,,4],type="l",
xlab="wavelengths",ylab="coefficients")

We can look for the peaks:

library(quantmod)
findPeaks(coefficients[,,4])

The results are that we have peaks at:

soy_ift_prot1r1$X_msc.....880nm  (Starch band)

soy_ift_prot1r1$X_msc.....910nm  (Protein band)

soy_ift_prot1r1$X_msc.....968nm  (Fat band)

soy_ift_prot1r1$X_msc....1028nm  (Protein / Fat band)
                       

The results make some sense and make me a little more confident that the regression will work fine in routine.

29 ene 2018

Analyzing soy meal in transmittance (part 10 and last)

This is the last post about analyzing soy meal in transmittance. The last option we make was to make a special selection from the original database to see if we get some improvements in the performance of the calibration, but with this option appears some slope, so I prefer the first option were the statistics were quite good but with a Bias.

With all the database the statistics for the PLS regression (with 4 terms) are:
RMSEP..........1,12
SEP............0,529 (Prediction error bias corrected)
RSQ............0,898
and the XY plot is:

With the samples selected with the approach described in the part 9, the statistics are:
RMSEP.......1,12
SEP.........0,633 (Prediction error bias corrected)
Sres........0,547 (Prediction error Slope/Intercept corrected)
RSQ.........0,898
and the XY plot is:

As we can see the first option is the best, and gives quite aceptable errors for protein in intact soy meal in transmittance.

And what about if we calibrate with Win ISI

and validate with a model with the 5 terms in Win ISI:

Samples used for Statistics           27
Slope                                 0.871
Intercept                             5.283
Bias                                  -0.703
SEP                                    0.927
SEP(C)                                 0.616 (Bias corrected)
RSQ                                    0.864

as we can see there are some differences, but the statistics are quite similar and again we have a bias, and an acceptable error.

As a first step we can work with the model adjusted until new samples from the new instrument, be added to the database an the model is recalculated.

Hope you all these posts and let me know if you need more details in the comments.

Analyzing soy meal in transmittance (part 9)

Analyzing soy meal in transmittance (part 8)

Analyzing soy meal in transmittance (part 7)

Analyzing soy meal in transmittance (part 6)

Analyzing soy meal in transmittance (part 5)

Analyzing soy meal in transmittance (part 4)

Analyzing soy meal in transmittance (part 3)

Analyzing soy meal in transmittance (part 2)

Analyzing soy meal in transmittance (part 1)

Analyzing Soy meal in transmittance (part 9)

In the case of the Soy meal, we have a validation sample set (from an Infratec Nova), and we get the predictions from a calibration sample set from an Infratec 1241. As we saw in "Analyzing Soy meal in transmittance (part 8)" , the predictions are fine, but we have a bias and the idea was to merge the samples from the Infratec Nova with the samples from the Infratec 1241 and to develop a calibration.

But before that we want to see another option, and for that we are going to use an option from Win ISI: "Select local samples from a product file".

With this option we project the validation samples in the PC space we got from the calibration samples, and we search for neighbors into a certain cutoff.

Those samples will be take them apart into a file, to develop an exclusive calibration for the validation samples.

In this case I am going to use a cutoff of 0,2 (Mahalanobis distance).

From the 657 samples, 527 were selected and exported to R to develop the calibration. There are some clear outliers, but there is a high improve in the statistics.

Now we remove the samples with number in red, because they are out of the action limits and recalculate.

>soy_ift_prot_sel1_r1<-soy_ift_prot_sel1[-c(495,102,83,270,74),]

>Prot_plsr_sel_r1<-plsr(soy_ift_prot_sel1$Prot~soy_ift_prot_sel1 $Xsel_msc,ncomp=16,data=soy_ift_prot_sel1,validation = "LOO")

>predictions_sel_r1<-(Prot_plsr_sel_r1$fitted.values[,,9])
>soy_ift_prot_sel2_r1<-cbind(soy_ift_prot_sel1_r1$Sample,soy_ift_prot_sel1_r1$Prot,predictions_sel_r1)

>monitor_prot_sel<-monitor10c24xyplot(soy_ift_prot_sel2_r1)

As you can see there is an improvement in the calibration statistics with this selection, but is it an improvement in the validation. We will see it in the next post.

28 ene 2018

Analyzing Soy meal in transmittance (part 8)

Continuing from the post "Analyzing Soy meal in transmittance (part 7)", we are going to remove the four samples which are out of the action limit (residual higher than 3.RMSEP) , and to recalculate the model.
soy_ift_prot1r1<-soy_ift_prot1[-c(183,107,108,267),]
Prot_plsr_r1<- plsr(soy_ift_prot1r1$Prot~soy_ift_prot1r1$X_msc,
                    ncomp = 16,data =soy_ift_prot1,
                   validation = "LOO")
summary(Prot_plsr_r1)
predictions<-(Prot_plsr_r1$fitted.values[,,9])
soy_ift_prot2r1<-cbind(soy_ift_prot1r1$Sample,
                       soy_ift_prot1r1$Prot,
                       predictions)
monitor_prot_r1<-monitor10c24xyplot(soy_ift_prot2r1)
With this code we get the new X-Y plot and the new statistics, and finally we are going to keep this model.

I don´t consider necessary to remove more samples, and the Monitor function give us the distribution of the residuals into the different regions:

  Residuals into 68%   prob (+/- 1SEP)    = 459     % = 70.72419 
  Residuals into 95%   prob (+/- 2SEP)    = 611     % = 94.14484 
  Residuals into 99.5% prob (+/- 3SEP)    = 646     % = 99.53775 
  Residuals outside 99.5% prob (+/- 3SEP) = 3       % = 0.4622496 

As we can see we get a Gaussian distribution for the residuals.

This calibration was done with data from an IFT1241.There is a new instrument called Infratec NOVA, and an exercise has been done in order to check if the calibration developed in an Infratec 1241 can be used in routine in an Infratec NOVA. With this purpose a set of external validation samples had been analyzed in an Infratec NOVA using the same transmittance path length than in the Infratec.

Once the samples had ben analyzed the spectra has been exported and as reference values we add the predicted values obtained in NIR reflectance instruments calibrated with values for the official reference methods.

We will use this data to check the model or adjust it if necessary. We can validate using different number of terms to see if the model is overfitted for this external data set, and we will se that this is the case and that the best results are for 4 terms, but there is a bias due probably to dome differences in the instruments itself.

With 4 terms the validation (with the Monitor function) is:

We see the actual values in red, and that a Bias adjustment is recommended, so with the bias adjustment we would see the yellow dots.
As we can see we have a bias, but the error with the Bias corrected is quite good (SEP=0,529).
If we add more terms the statistics are not so good like this, so maybe the best option is to add this samples to the data base and recalibrate to add the new variability to the model.

27 ene 2018

Analazing Soy meal in transmitance (part 7)

In the values of the constituents, we have some values with zeros, so these values must not be considered during the calibration. If we have a long data set we can look for the minimum and maximum values, and if the minimum is zero we can remove the samples with this value to develop the quantitative models.

Here are the histograms of the data sets without the samples with zeros.

We can make a PLS regression with all the samples and after to remove the outliers we found clear. So for the protein the PLS regression would be:

Prot_plsr<- plsr(soy_ift_prot1$Prot~soy_ift_prot1$X_msc,
ncomp = 16,data =soy_ift_prot1,
validation = "LOO")

where we use the LOO (leave one out) validation.

The LOO cross validation, will help us to decide which is the best number of terms to choose for the regression, so we can look to one of the explained variance plot, where we can see how the RMSEP decrease as the number of terms increase, but there will be a certain number of PLS terms where the RMSEP stay stable or even increase, so we must nor choose more terms than necessary in order not to over fit the model.

plot(Prot_plsr,"validation",estimate="CV")

If we look to the regression summary,

summary(Prot_plsr)

we can see that the best number of terms for the regression is nine.

Let´s see the statistics in a XY plot, and for it I am going to use a Monitor function I developed in R some time ago.

predictions<-(Prot_plsr$fitted.values[,,7])

soy_ift_prot2<-cbind(soy_ift_prot1$Sample,

soy_ift_prot1$Prot,

predictions)
monitor10c24xyplot(soy_ift_prot2)

As we can see we must remove some outliers, which are out of the action limit (numbers in red), and decide what to do with the samples are out of the warning limit (numbers in orange).

The Monitor function take apart those sample, so we can remove them from the data frame and recalculate.

26 ene 2018

White paper about Artificial Neural Networks

White paper:

Artificial Neural Networks and Near Infrared Spectroscopy - A case study on protein content in whole wheat grain

25 ene 2018

Quantitative quick view (Soy meal analyzed in transmitance)

During this tutorial about the analysis of Soy meal in transmitance with an Infratec, we will see also the quantitative analysis, once we have seen the spectra characteristics.

This is just a quick look to a quantitative analysis to see where the extreme spectra samples appear in the XY plots of the quantitative analysis.

As we can see in some of the plots appear the sample "298" as an outlier (specially in Moisture). There are clearly chemical outliers with a high residual, due probably, to a mistake in the lab value.

Soy meal is normally analyzed by NIR instruments and not by NIT instruments, but we will continue with this work to see what statistics we can obtain from the equations and from a validation.

24 ene 2018

Analyzing Soy meal in transmitance (part 6)

A good way to see the variance explained by the PCs is a 2D plot, where we see the projection of the scores over the PC terms, so it is a way to see in which PC term we can see a discrimination, or outliers.
In the case of the soy meal data, we can see the distribution of the scores in the plane formed by the first and second principal components.

and now imagine projecting the dots o perpendicularly over the axes (PCs), and this projections are the perpendicular dots of the next plot for the first and second PCs. In the projections of the first PC we see clearly out the samples 298 and 296, and if we would make a zoom of the second PC projections we would see clearly out the samples 373 and 298.

As we can see in this plot the whole variance of the data is explained by the first three PCs.

23 ene 2018

Analyzing Soy meal in transmitance (Part 5)

Under all these post about Analyzing Soy meal in transmitance there is the excuse to work with "R" and to see a lot of chemometric functions which the available R packages offer.
So continuing with this this is the fifth post about it.

We are more use to see the Mahalanobis distance with ellipses, so let see the same as in the previous post with the "drawMahal" function of the Chemometric package.
First we use the Nipals algorithm to calculate the score matrix T and the loading matrix P with the X matrix with the Math treatment MSC (Multiple Scatter Correction).
X_msc_nipals<-nipals(X_msc,a=2)
T_msc<-X_msc_nipals$T
P_msc<-X_msc_nipals$P
drawMahal(T_msc,center=apply(T_msc,2,mean),
          covariance=cov(T_msc),
          quantile=0.975,col="blue",
          xlab="PC1",ylab="PC2")
identify(T_msc)

As we can see we have the same outliers, but we see them in a different way.

22 ene 2018

Analyzing Soy meal in transmitance (Part 4)

This is the fourth of the posts about analyzing soy meal unground in an Infratec, adding the sample directly to the conveyor in the same way that we do with wheat or barley. The range wavelength in the Infratec is from 850 to 1050nm, in steps of 2 nm, so we have a total of 100 data points.

Continue from posts:
"Analyzing Soy meal in transmitance (part 3)"
"Analyzing Soy meal in transmitance (part 2)"
"Analyzing Soy meal in transmitance (part 1)"

I am use to look for outliers using the Mahalanobis distance (MD), which is based in the scores values for the samples in the Principal Component Space.

There are several packages in R, to see the value of the MD, and one of them is the package "Chemometrics", so we load this package and run sam script wit the values we have get from the previous post.

We can fit to ablines to configure the MD, one for Warning with a vaue of 3.00 and another for the Action with a value of 4.00. The line for warning is orange and for action is orange.

library(chemometrics)
X_msc_pca<-princomp(X_msc,cor=TRUE)
res<-pcaDiagplot(X_msc,X_msc_pca,a=2)
plot(res$SDist,ylim=c(0,8),ylab="score distance",

xlab="sample number")
identify(res$SDist)
abline(h=4,col="red")
abline(h=3,col="orange")

and we get this plots, where we can see how samples 298 and 373 appear as outliers, specially for their high scores values in the PC number 2 as we have seen in the previous posts.

21 ene 2018

Analyzing Soy Meal in transmitance (part 3)

This is the third of the posts about analyzing soy meal unground in an Infratec, adding the sample directly to the conveyor in the same way that we do with wheat or barley. The range wavelength in the Infratec is from 850 to 1050nm, in steps of 2 nm, so we have a total of 100 data points.

Continue from post: "Analyzing Soy meal in transmitance (part 2)"

We can change the scale of the loading plots we have seen in the last post in order to over-plot them with the scores of the samples in the plane formed by the first two principal components.

So first step is to change the scale of the plot in the left to the scale on the right:

Now if we use the biplot function, we can see the scores and the loadings in the same plot:
biplot(X_msc_prcomp)

we see that all the samples are quite grouped in this plane, but there are samples (373 and 298) which are out of the group, specially in the direction of the second PC.

we want to keep these samples in a set called "extremes", to check them apart.

extremes<-c(298,373)
X_msc_extr<-scale(X_msc,scale=FALSE)[extremes,]
matplot(wavelengths,t(X_msc_extr),type="l",
xlab="Wavelengths (nm)",ylab="Intensity (mean-scaled")

Now we can compare these samples with all the samples in the data set (including the extremes).

We can identify clearly sample 298 in the two plots, but 373 is not easy to see on the left plot, but it is in the direction of sample 298, so we can see the constituents values of this samples to have an idea why they are extremes in the second PC.

20 ene 2018

Analyzing Soy meal in transmitance (part 2)

This is the second of the posts about analyzing soy meal unground in an Infratec, adding the sample directly to the conveyor in the same way that we do with wheat or barley. The range wavelength in the Infratec is from 850 to 1050nm, in steps of 2 nm, so we have a total of 100 data points.

Continue from post: "Analyzing Soy meal in transmitance (part 1)"

Once we have decided one of the math treatments to work, we can apply a Principal Components analysis to the data. This way we can understand better the structure of the data.
X_msc_prcomp<-prcomp(X_msc)

This way we obtain two importan matrices, the score matrix and the loadings matrix (We have been talking about this matrices in other posts).
In this post we are going to check the loadings that we can see graphically in two ways: as spectra or in the Principal Component space.
If we want to se them as spectra (first three loadings), run this script in R:
>matplot(wavelengths,X_msc_prcomp$center,type="l",
xlab="wavelengths",ylab="transmitance")

Or we can see them in the Principal Components space, were we can see the range of variation

and this is the script for this last plot:

par(mfrow=c(1,2),pty="s")
offset<-c(0,0.09) # to create space for labels
plot(X_msc_prcomp$rotation[,1:2],

type="l",xlim=range(X_msc_prcomp$rotation[,1])+offset,
xlab="PC1",ylab="PC2")
#identify(X_msc_prcomp$rotation[,1:2])
points(X_msc_prcomp$rotation[c(33,64,100),1:2],pos=4)
text(X_msc_prcomp$rotation[c(33,64,100),1:2],pos=4,

labels=paste(c(914,976,1050),"nm"))

offset<-c(-0.05,0.25) # to create space for labels
plot(X_msc_prcomp$rotation[,2:3],type="l",

     xlim=range(X_msc_prcomp$rotation[,1])+offset,
     xlab="PC2",ylab="PC3")
#identify(X_msc_prcomp$rotation[,2:3])
points(X_msc_prcomp$rotation[c(33,64,100),2:3],pos=4)
text(X_msc_prcomp$rotation[c(33,64,100),2:3],pos=4,
     labels=paste(c(914,976,1050),"nm"))

19 ene 2018

Implementing Good Product in Win ISI Diagram

New features Good Product added to Win ISI Diagram.

See Video in LinkedIn

Video in iPad

https://www.linkedin.com/feed/update/urn:li:activity:6357682248438943744

18 ene 2018

Analyzing Soy meal in transmitance (part 1)

One of the common applications in NIR analysis is the measure of soy meal, to predict Moisture, Protein, Fat and Fiber. As we know, Protein is the most important parameter and it is important to get an accurate prediction.

What about to measure soy meal in a transmittance instrument like Infratec?. Infratec has a smaller range, but this range (850 to 1050 nm) penetrate most into the sample, so we can measure in transmittance with a certain path length to avoid saturation. With this purpose, a certain number of samples with known reference value for the parameters was analyzed in the instrument, putting the soy meal unground and directly into the conveyor.

Spectra of the samples was export in a spectra file and lab values added.

Spectra file was export into R software as raw spectra, and a multiple scatter correction was added.

X<-as.matrix(sm_ift[,6:105])
wavelengths<-as.matrix(seq(850,1048,by=2))
matplot(wavelengths,t(X),type="l",
xlab="wavelengths",ylab="transmitance")

#Math Treatments
#Multiple Scatter Correction
library(pls)
X_msc<-msc(X)
matplot(wavelengths,t(X_msc),type="l",
xlab="wavelengths",ylab="transmitance")

#Mean Centering
#We can check in wich areas is the variation of the data.
X_msc_mc<-scale(X_msc,scale=FALSE)

#We can see the variation at every wavelength with a boxplot.
boxplot(X_msc)