20 may. 2018

Monitoring LOCAL calibrations (part 2)

Following with the previous post when monitoring LOCAL equations we have to consider several points. Some of them could be.
  1. Which is the reference method and what is the laboratory error for the reference method. Ex:Reference method for protein can be from Dumas or from Kjeldahl .
  2. Do you have data from several laboratories or from just one?. Which reference method use every lab and what is their laboratory error. 
  3. If we have data in the Monitor from both methods, or from different laboratories, split them apart and see which error  SEP do we have. Check also the SEP corrected by the Bias.
  4. Check the performance of the SEP and other statistics for every specie. In meat meal for example you have pork, mix, poultry, feather,....
Sure you can find more knowledge about the data:
  1. Do we have more error with one method than another, with one lab than another?.
  2. Does perform a reference method better for one particular specie than another?.
  3. ................................
Make conclusions from these and other thoughts.
For all these, is important to organize well the data and use it fine. Don´t forget that we are comparing reference versus predicted, and that the prediction depends of the model itself apart from the reference values, so we have to be sure that our model is not overfitted or wrong developed.

19 may. 2018

Monitoring LOCAL calibrations

Databases to develop Local calibrations has normally a high quantity of spectra with lab values, but we have to take care of  them adding new sources of variance. This way we make them more robust and the standard prediction errors (SEP) decrease when we validate with future independent validation sets.
This was the case with a meat meal local database updated with 100 samples with protein values, and with new source of variation as: laboratories,  reference methods (Kj, Dumas), providers, new instruments,...
After the LOCAL database update, a set of new samples was received with reference values and I have predicted this values with the Monitor function in Win ISI with the LOCAL database before (blue dots) and after the LOCAL update (red dots).
The errors decrease , specially for some types of samples, in an important way when validating with the new set of samples (new samples acquired after the updated Local calibration was installed in the routine software), so even if we have spectra from this instrument, labs, ...., this set has to be considered as an independent set.
I don´t give details of the statistics but this picture show the same samples predicted with the LOCAL calibration without update (in blue), and predicted with the LOCAL calibration update (in red), the improvement for the prediction for some samples is important, so the idea is to add this new samples and continuing monitoring the LOCAL database with future validation sets.

17 may. 2018

Nils Foss died on 16-May-2018

Sad news from FOSS: Nils Foss (Foss founder) died yesterday.
He was an good example of entrepreneur.

15 may. 2018

Random selection before calibration.

Following a course about Machine Learning with R, I realize of the importance of the random selection of the samples for the development of the models.
R has good tools to select the samples randomly and to do it was a common practice during all the exercises in the course.
I do it, in the case with the soy meal samples I have used for several posts, so we will compare the results.
The idea of random selection is to make the model robust against the bias which we see quite often when validating the models with independent samples.
We can see if the number of the terms selected change or if we get similar results to the previous selection using an structure selection of odd and even samples.
Random selection order is important also for a better cross validation.
Here is the code and preparation of the sample sets for the development of the models.

##################### SPLITTING SAMPLES RANDOMLY #############  
#In this case we need first the dataframe "soy_ift_conv"
#Split the data into 65% training and 35% test

#Plotting Training and Validation sets overlapped.


We see in gray the validation samples selected and in blue the training samples.


2 may. 2018

First approach to ANN calibrations

This is a first approach to develop ANN calibrations with the soymeal data from Infratec and it is really promising.
I follow this code and plot the results:

#We build a dataframe with the constituent values
#and with the spectra math treated with MSC
Y<-as.matrix(sm_ift[,2])        # La matriz Y son los datos de proteina.
which(is.na(Y), arr.ind=TRUE)   # La 462 tiene por valor NA y la quitamos
X<-as.matrix(sm_ift[,6:105]) # La matriz X son los datos NIR
##PRINCIPAL COMPONENTS ANALYSIS using package "Chemometrics" NIPALS)

#' Split the data into 65% training and 35% test

#' Create an neural network model with 3 hidden
#' nodes of Y~. using the training data.
#' Store this model as 'model'

#' Use this model to estimate the Y values of the test data
#' Calculate the MSE of the model on the test data and output
#' it using the print or cat functions
cat("The MSE of the model on the test data is: ",mse,"\n")
#The MSE of the model on the test data is:  0.9314747

The loss function in this case is MSE (Mean Square Error).
More practice these coming days

30 abr. 2018

Validation with LOCAL calibrations

When developing a LOCAL calibration in Win ISI, we use an input file and a Library file with the idea the idea to select the best possible model, so the selection of a well representative input (let´s call validation set) is very important to have success in the development of the model.
So the model is conditioned to the input file, if we have choose another input file we could have get another model which performs different, so the need of a Test Set is obvious to check how the model performs with new data.
It is important to have this in mind, so one proposal would be to divide (randomly) the data in three sets: 60% for training, 20% for Input or Validation, and another 20% for testing.
There are other ways to sort the data in order to select these three Sets (time, seasons, species,...). One thing is clear, some of the models developed will perform better than others, so you can keep several of them and you can check this when you have new data and use an statistic an MSE (Mean Square Error) to compare them.

19 abr. 2018

Projections over the planes in PC space

This is a plot in 3D to see the orthogonal projections over the plane formed by first and second PC calculated with SVD. Once projected over the plane, projections arte projected again on the new axis (all them: terms or loadings,.....) to calculate the score for every PC.
Plot can be moved with the mouse so find the better view.

Data is centered so we see samples on both sides of the plane.

17 abr. 2018

Covariance plot and Mahalanobis ellipse

In the previous post we have seen  the correlation plot and in this one the covariance plot  matrix, we just have to change the function "cor" for "cov" in the code.
Covariance  matrix is used frequently in chemometrics, like in this case to define the ellipse for the Mahalanobis distance using the covariance between the two axes (one for the 964 wavelength and the other the linear combination of wavelengths 1022 and 902 that we have seen in other recent post.
We can do this easily with the package chemometrics and this code:



to get this plot:

16 abr. 2018

Correlation Matrix plot

In the last posts we are talking about the wavelength space due to its high collinearity, because we want to select wavelengths with few correlation between them in order to develop a model.
In this task we can check the correlation matrix, which is better to check in a plot than with numbers. This is the plot for the soy meal samples in transmitance using the 100 wavelengths from 850 to 1048 nm in steps of 2 nm, so the correlation matrix is a 100.100 diagonal and symmetric matrix as can be seen in the plot.
The red line is the correlation of the 962 nm wavelength with all the rest including itself (1 in this case). The vertical blue lines are the wavelengths at 1022,902 and 962 used in the recent posts.
See the Correlation matrix plot and code: