16 ago. 2018

Checking the slopes in Validation Groups

This is an study to develop calibrations for meat in a reflectance instrument from 1100 to 1650 nm. Normally meat are measured in transmitance but this is an approach to do it in reflectance.
I have just 64 samples with fat laboratory data. I split the spectra into 4 sets of 16 samples and merge 3 of leaving the other three for external validation. So I have 48 samples for training and 16 for validation and I can develop four calibrations and validate with 4 external sets.
Considering that we have few samples are in the training set, I have to use few terms. The SEPs for external or Cross Validation are quite high , but the idea here is to see the changes in the slope for the four validation sets.
The reason is that we have few samples and the slope value will stabilize as soon as more samples are included into the calibration and validation sets.
To improve the SEP we have to check the sample presentation method for this product and the procedure to obtain the laboratory reference method.

3 ago. 2018

Monitoring the performance with the histogram

NIR can be used to detect levels food additives and check if they are in the right limits.
In this cases there are several types of doughs, and they use two levels of additive concentration depending on the type. So we have always the same reference data.
A calibration is developed and we have new data to validate. NIR will give  results which I expect to be covering the reference value with a Gauss distribution.
Using the Monitor function I can see the prediction distribution vs. the reference distribution and check if the expectations are fine.

In the case of the higher concentration is fine, and in the lower concentration is skewed (that is why the S/I adjust is suggested).This can be a first approach to continue with this application with mor accurate reference values.

13 jul. 2018

Validating Resemble Model with Monitor function

Continuing with this post evaluating the LOCAL model developed in Resemble. This time I use the Monitor function (one of the Monitor functions I am developing).

I create different subsets from the validation sample set for the different categories, In this case is for one time of puppies, and I am evaluating the moisture content. We can see that there are two outliers that increase the SEP, so we have to see if we remove this samples for some reasons.

Let´s validate first with this type of puppy subset and check the statistics:

> val1.moi.pup1<-subset(val1.moi,ID1u_moi=="PUPPY-1")
> val1.moi.pup1<-cbind.data.frame(val1.moi.pup1$Sample.u_moi, 
+                                 val1.moi.pup1$Yu_moi, 
+                                 val1.moi.pup1$predicted.moi.local)
> monitor10c24xyplot(val1.moi.pup1)

Samples with the Sample IDs 463 y 456 are out of the action limits and the monitor function shows their position in the table:
$ResWarning [1] id ref pred res <0 rows> (or 0-length row.names) $ResAction id ref pred res 34 456 3.7 7.793351 -4.093351 32 463 4.9 7.881543 -2.981543

Now we can remove this samples knowing their position and recalculate:

6 jul. 2018

Plots in Resemble (part 2)

Good results for the prediction of the validation samples (Xu, Yu) for protein. This is the XY plot where we can see in different colors the classes of the validation samples (different types of petfood). the SEP is 0.88 (without removing outliers) . Defining the data frame by classes will allow us to see the SEP for every class so we can check which class needs for more samples in the training database (Xr, Yr) or to check for other reasons.


El error SEP de las muestras de validación de proteína es   :  0.887 
El R cuadrado para las muestras de validación de proteína es:  0.962 

29 jun. 2018

Plots in Resemble (Part 1)

Resemble allow a certain number of plots which are very useful for your works or personal papers. In this case I use the same sets than of the previous post and I plot the PCA scores, where I can see the training matrix (Xr) scores and the validation matrix (Xu) scores overlapped.
Validation set is a 35% (randomly selected) of the whole sample population obtained from a long time period.
We can see how the validation samples cover more or less the space of the training samples.

> par(mfrow=c(1,2))
> plot(local.mbl, g = "pca", pcs=c(1,2)) 
> plot(local.mbl, g = "pca", pcs=c(1,3))

28 jun. 2018

Using correlation in LOCAL (Resemble package)

> ctrl.mbl <- mblControl(sm = "cor",
                  pcSelection = list("cumvar", 0.999),
                  valMethod = "NNv",
                  scaled = FALSE, center = TRUE)
 > local.mbl <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
mblCtrl = ctrl.mbl,
                   dissUsage = "none",
                   k = seq(40, 150, by = 10),
                   pls.c = c(5, 15),
                   method = "wapls1")

Predicting sample: 1  ----------
Predicting sample: 2  ----------
Predicting sample: 3  ----------
Predicting sample: 4  ----------
Predicting sample: 5  ----------
> plot(predicted.local,Yu)


This time I use correlation (as Win ISI use) and try to find the best number of samples to select for the LOCAL algorithm with a sequence. 

As we can see the predictions improve with more samples in the calibration (red dots), maybe could be better win more samples by at the end of the plot it start to stabilize.

27 jun. 2018

LOCAL Calibrations with Resemble

Really interesting the Resemble package so I am trying to work and understand it better even to help me in the case of Win ISI LOCAL calibrations.
We can get predictions for different combinations of local selected samples for the calibration to predict the unknown, so we can see the best option. We use a certain number of terms (min. and max.) and a weighted average  is calculated.
In this case I use an external validation set of petfood Xu with Reference data (protein) Yu, and I want to know the statistics (RMSE and R square) for the case of 90 local samples selected:

predicted.local <- as.numeric(Yu_anl$Nearest_neighbours_90)
> rmse.local <- sqrt(mean((Yu - predicted.local)^2))
> R2.local <- cor(Yu, predicted.local)^2
> R2.local
[1] 0.9507232
> rmse.local
[1] 1.163304
There are a lot of options to explore, so I will continue checking this package.

26 jun. 2018

Memory based learning methods and tools (Resamble / LOCAL)

This is the link to this presentation which help us to understand the concept of LOCAL that we will treat during next posts with the "Resamble package" and we have treated and we will continue with LOCAL in Win ISI.

Developing LOCAL calibrations with R

We can use also LOCAL in R with the Resemble package. I am testing the package these days with a set of petfood spectra (with protein reference values) imported from Win ISI with SNV and a second derivative math treatment. After, I select 65% for training and the rest for test.
The get predictions process of Resemble allow a configuration to check for the better number of sample or factors for the better prediction, so there are a lot of options and functions to check in this package.
This is a plot of the results for a standard configuration from the reference manual, that I would try to go more deep into, trying to find the best configuration.

ctrl <- mblControl(sm = "pls",
                   pcSelection = list("opc", 40),
                   valMethod = c("NNv"),
                   scaled = FALSE, center = TRUE)

ex1 <- mbl(Yr = Yr, Xr = Xr, Yu = NULL, Xu = Xu,
           mblCtrl = ctrl,
           distUsage = "predictors",
           k = seq(30, 150, 15),
           method = "wapls1",
           pls.c= c(7, 20))


Clearly seems that some of the configurations have overfitting, but I am just starting to learn the package so more post will come up giving my progress with this package.

14 jun. 2018

Precission and Accuracy LAB vs NIR

Finally happy with this plot trying to explain the precision and accuracy of the laboratory vs. the NIR predictions for a set of samples and subsamples. I will explain more detail of this plots in coming posts.

6 jun. 2018

Subsample Predictions Boxplot

This is a boxplot where there are four subsamples of meat meal predictions. A representative of a certain batch has been divided in four subsamples and analyzed in a NIR. So we get four predictions, one for every subsamples, so the  the boxplot gives an idea of the variance in the predictions for every sample based on their subsamples.
The colors are because the subsamples had been send to two different labs, so one is represented by one color. Colors had certain transparency because in some cases, two samples went to a lab and two to the other, in other cases the four subsamples went to the same lab and even in some cases three to one lab and one to another.
All these studies give an idea of the complexity of the meat meal product.

3 jun. 2018

Average and subsample residuals

In order to understand better the performance of a model, different blind subsamples of a sample had been sent to a laboratory, so in some cases we have the lab values of four subsamples of a sample and in other cases two subsamples of a sample. There are two cases with only one subsample.
For every subsample we calculate the average for the lab, and the average for the predictions, to get the red dot residuals.
We have also the residual of every subsample vs its prediction and those are the gray dots.
The plot (with R) gives a nice information about the performance of the model and how the average performs better in most cases than the individual subsamples.
We can see the warning (2.SEL) and action limits (3.SEL), and how the predictions for the average fall into the warning limits.

31 may. 2018

Comparing Residuals an Lab Error

One important point before  develop a calibration is to know the laboratory error. This error, known as SEL (Standard Error Laboratory).
This error change depending of the product, because it can be very homogenous or heterogeneous, so in the first case the lab error is lower than in the second case.
In the case of meat meal the error are higher than for other products and this is a case where I am working these days and I want to share in this posts.
A certain number of samples (n) well homogenized had been divided into two or four subsamples and had been send to a laboratory for the Dumas Protein analysis. After receive the results, a formula used for this case, based on the standard deviation of the subsamples and in the number of samples, gives the laboratory error for protein in meat meal.
The result is 1,3 . Probably some of you can think that this is a high value, but really, the meat meal product is quite complex and I think is normal.
Every subsample that went to the lab has been analyze in a NIR, and the spectra of the subsamples studied apart with the statistic RMS that shows that the samples were well homogenized.
Now I have the option to average the predictions of the NIR predictions (case 1), or to average the spectra of the subsamples, predict it with the model and get the predicted result (case 2). I use in this case the option 1 and plot a residual plot with the residuals of the average predictions subtracted from the lab average value:

Blue line is +/- “ 1.SEL”, yellow +/- “ 2.SEL” and red +/- “ 3.SEL”.


20 may. 2018

Monitoring LOCAL calibrations (part 2)

Following with the previous post when monitoring LOCAL equations we have to consider several points. Some of them could be.
  1. Which is the reference method and what is the laboratory error for the reference method. Ex:Reference method for protein can be from Dumas or from Kjeldahl .
  2. Do you have data from several laboratories or from just one?. Which reference method use every lab and what is their laboratory error. 
  3. If we have data in the Monitor from both methods, or from different laboratories, split them apart and see which error  SEP do we have. Check also the SEP corrected by the Bias.
  4. Check the performance of the SEP and other statistics for every specie. In meat meal for example you have pork, mix, poultry, feather,....
Sure you can find more knowledge about the data:
  1. Do we have more error with one method than another, with one lab than another?.
  2. Does perform a reference method better for one particular specie than another?.
  3. ................................
Make conclusions from these and other thoughts.
For all these, is important to organize well the data and use it fine. Don´t forget that we are comparing reference versus predicted, and that the prediction depends of the model itself apart from the reference values, so we have to be sure that our model is not overfitted or wrong developed.

19 may. 2018

Monitoring LOCAL calibrations

Databases to develop Local calibrations has normally a high quantity of spectra with lab values, but we have to take care of  them adding new sources of variance. This way we make them more robust and the standard prediction errors (SEP) decrease when we validate with future independent validation sets.
This was the case with a meat meal local database updated with 100 samples with protein values, and with new source of variation as: laboratories,  reference methods (Kj, Dumas), providers, new instruments,...
After the LOCAL database update, a set of new samples was received with reference values and I have predicted this values with the Monitor function in Win ISI with the LOCAL database before (blue dots) and after the LOCAL update (red dots).
The errors decrease , specially for some types of samples, in an important way when validating with the new set of samples (new samples acquired after the updated Local calibration was installed in the routine software), so even if we have spectra from this instrument, labs, ...., this set has to be considered as an independent set.
I don´t give details of the statistics but this picture show the same samples predicted with the LOCAL calibration without update (in blue), and predicted with the LOCAL calibration update (in red), the improvement for the prediction for some samples is important, so the idea is to add this new samples and continuing monitoring the LOCAL database with future validation sets.

17 may. 2018

Nils Foss died on 16-May-2018

Sad news from FOSS: Nils Foss (Foss founder) died yesterday.
He was an good example of entrepreneur.

15 may. 2018

Random selection before calibration.

Following a course about Machine Learning with R, I realize of the importance of the random selection of the samples for the development of the models.
R has good tools to select the samples randomly and to do it was a common practice during all the exercises in the course.
I do it, in the case with the soy meal samples I have used for several posts, so we will compare the results.
The idea of random selection is to make the model robust against the bias which we see quite often when validating the models with independent samples.
We can see if the number of the terms selected change or if we get similar results to the previous selection using an structure selection of odd and even samples.
Random selection order is important also for a better cross validation.
Here is the code and preparation of the sample sets for the development of the models.

##################### SPLITTING SAMPLES RANDOMLY #############  
#In this case we need first the dataframe "soy_ift_conv"
#Split the data into 65% training and 35% test

#Plotting Training and Validation sets overlapped.


We see in gray the validation samples selected and in blue the training samples.


2 may. 2018

First approach to ANN calibrations

This is a first approach to develop ANN calibrations with the soymeal data from Infratec and it is really promising.
I follow this code and plot the results:

#We build a dataframe with the constituent values
#and with the spectra math treated with MSC
Y<-as.matrix(sm_ift[,2])        # La matriz Y son los datos de proteina.
which(is.na(Y), arr.ind=TRUE)   # La 462 tiene por valor NA y la quitamos
X<-as.matrix(sm_ift[,6:105]) # La matriz X son los datos NIR
##PRINCIPAL COMPONENTS ANALYSIS using package "Chemometrics" NIPALS)

#' Split the data into 65% training and 35% test

#' Create an neural network model with 3 hidden
#' nodes of Y~. using the training data.
#' Store this model as 'model'

#' Use this model to estimate the Y values of the test data
#' Calculate the MSE of the model on the test data and output
#' it using the print or cat functions
cat("The MSE of the model on the test data is: ",mse,"\n")
#The MSE of the model on the test data is:  0.9314747

The loss function in this case is MSE (Mean Square Error).
More practice these coming days

30 abr. 2018

Validation with LOCAL calibrations

When developing a LOCAL calibration in Win ISI, we use an input file and a Library file with the idea the idea to select the best possible model, so the selection of a well representative input (let´s call validation set) is very important to have success in the development of the model.
So the model is conditioned to the input file, if we have choose another input file we could have get another model which performs different, so the need of a Test Set is obvious to check how the model performs with new data.
It is important to have this in mind, so one proposal would be to divide (randomly) the data in three sets: 60% for training, 20% for Input or Validation, and another 20% for testing.
There are other ways to sort the data in order to select these three Sets (time, seasons, species,...). One thing is clear, some of the models developed will perform better than others, so you can keep several of them and you can check this when you have new data and use an statistic an MSE (Mean Square Error) to compare them.

19 abr. 2018

Projections over the planes in PC space

This is a plot in 3D to see the orthogonal projections over the plane formed by first and second PC calculated with SVD. Once projected over the plane, projections arte projected again on the new axis (all them: terms or loadings,.....) to calculate the score for every PC.
Plot can be moved with the mouse so find the better view.

Data is centered so we see samples on both sides of the plane.

17 abr. 2018

Covariance plot and Mahalanobis ellipse

In the previous post we have seen  the correlation plot and in this one the covariance plot  matrix, we just have to change the function "cor" for "cov" in the code.
Covariance  matrix is used frequently in chemometrics, like in this case to define the ellipse for the Mahalanobis distance using the covariance between the two axes (one for the 964 wavelength and the other the linear combination of wavelengths 1022 and 902 that we have seen in other recent post.
We can do this easily with the package chemometrics and this code:



to get this plot:

16 abr. 2018

Correlation Matrix plot

In the last posts we are talking about the wavelength space due to its high collinearity, because we want to select wavelengths with few correlation between them in order to develop a model.
In this task we can check the correlation matrix, which is better to check in a plot than with numbers. This is the plot for the soy meal samples in transmitance using the 100 wavelengths from 850 to 1048 nm in steps of 2 nm, so the correlation matrix is a 100.100 diagonal and symmetric matrix as can be seen in the plot.
The red line is the correlation of the 962 nm wavelength with all the rest including itself (1 in this case). The vertical blue lines are the wavelengths at 1022,902 and 962 used in the recent posts.
See the Correlation matrix plot and code: 





14 abr. 2018

ellipses and ellipsoids in the wavelength space

Let´s look to the plane from the previous post:

We can see how the dots form like an ellipse, and this is a characteristic when plotting some wavelengths versus others.

In this case we see the ellipse in a plane but we can see them as well as ellipsoids in 3D or more dimensions.

13 abr. 2018

Regression planes in R (wavelength space)

We have seen the high correlation between wavelengths in previous posts and how we can reduce the wavelength space from lower dimension plots. 

In the case of the 3 wavelengths selected in the previous post at 1022, 964 and 902 nm. 

We can use the wavelength at 1022 nm as the dependent variable and to calculate a MLR regresión to predict the absorbances at this wavelength as a linear combination of the absorbances at 902 and 964 nm calculating a regression plane;

    #1022nm     datapoint 87    
  # 902nm     datapoint 27  
  # 964nm     datapoint 58   

s3d<-scatterplot3d(x1,x2,x3,pch=16,highlight.3d = TRUE,
                   angle=330,xlab="902 nm",
                   ylab="964 nm",zlab="1022 nm")       
s3d$plane3d(fit,lty.box = "solid")

We can see the plane looking to the new regression plane plot:

8 abr. 2018

Intercorrelation between constituents and wavelengths

In the post "Linear Combinations to improve correlation" we selected three wavelengths where normally we found overtones for Protein, Fat and Fiber (Cellulose), and we found a linear combination of these three wavelengths to improve the correlation with the Protein constituent.
In this one we continue with the study of this data set. We have not just the Protein constituent, we have the values for Moisture, Fiber and Fat as well in the "Y" matrix and we want to check the inter-correlation between the constituents.
In the case we have NA values for some sample we won´t get the correlation value, and this is the case, so we remove first the samples with NA values and calculate the correlation between the constituents:


           Protein     Moisture     Fiber       Fat
Protein   1.0000000  0.15406257 -0.83770097 -0.16767249
Moisture  0.1540626  1.00000000 -0.16143400 -0.09022824
Fiber    -0.8377010 -0.16143400  1.00000000  0.07874072
Fat      -0.1676725 -0.09022824  0.07874072  1.00000000

We see how we find a high negative correlation between Fiber and Protein, but this is normal in the case of soy meal.
But do we have high correlation between the absorbance of the wavelengths we associate with Protein and Fiber.

#1022nm  datapoint 87       Protein or Oil
#1008nm  datapoint 80       Oil or Water
# 996nm                     Oil or Water
# 902nm  datapoint 27       Cellulose      
# 964nm  datapoint 58       CH2 Oil       

> cor(X_msc[,87],X_msc[,27])
  [1] -0.9957298
plot(x1,x2,xlab="1022 nm  Protein?",ylab="902 nm Fibre?",
     col="green",main = "cor(X_msc[,87],X_msc[,27])")

7 abr. 2018

Linear combinations to improve correlation

With the data from soy meal on IFT conveyor, I select three wavelengths for this demo:
# 1022nm     datapoint 87     Protein or Oil  
#  902nm     datapoint 27     Cellulose        
#  964nm     datapoint 58     CH2 Oil


We have the values for Protein for these spectra.

Protein <- Prot

Let´s see the wavelengths in the mean centered MSC treated spectra


now see how the correlation becomes better in the case of the 4td plot where the X axis is a linear combination of the other 3 wavelengths
We have worked with this data before with PLS and PCR and what we have done here is a MLR approach. An intercept value will place the date on the same scale.

5 abr. 2018

X spectra matrix redundancy

We know about the redundancy in the columns of the wavelength matrix X, where there is a high correlation between many of the wavelength so we should create new matrices to  represent the linear combinations of the X matrix.
In this picture, I select 3 wavelengths  of the 100 and we can see how two of them are highly correlated and the information can be represented in a plane indeed in a cube, so we change the space from 3 variables to just 2.
Columns X1 and X3 has a high correlations so they are dependent vectors because they go in the same direction. X2 goes in a different  direction so it is independent  from X1 or X3.