14 oct. 2018

CARET:Splitting Based on the Predictors

I´m practicing with CARET, and the best way is to follow the tutorials in the webpage. This time is the way how we can split the data with Caret:

4.2:Splitting Based on the Predictors

Read and try to understand the concept.
I try to write the code of the plot for the plot  of the figure, and finally more or less I do it:

testing <- scale(BostonHousing[, c("age", "nox")])
set.seed(5)
## A random sample of 5 data points
startSet <- sample(1:dim(testing)[1], 5)
samplePool <- testing[-startSet,]
start <- testing[startSet,]
newSamp <- maxDissim(start, samplePool, n = 20)
newSamp<-samplePool[newSamp,]
rownames(newSamp)<-c(1:20)
plot(samplePool[,1],samplePool[,2],pch=20,

     col="grey",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="age",ylab="nox")
par(new=TRUE)
plot(start[,1],start[,2],pch="S",

     col="red",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="",ylab="",cex=1.3, font=2)
par(new=TRUE)
plot(newSamp[,1],newSamp[,2],col="blue",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="",ylab="")
text(newSamp[,1],newSamp[,2],

     labels=rownames(newSamp),cex=1.3,     
     font=2)

The samples chosen are different because of the random order. See how the distribution of the chosen samples cover the structure of the data.

10 oct. 2018

PCA with Caret

In this plot we teste different types of Principal Components Analysis with different packages. This time I use Caret.

I use the same Tecator Meat data which comes with the package. Spectra is treated with MSC  (Multiple Scatter Correction) and I represent the plane of the scores with the two terms chosen by the PCA processing:

absorp_pca<-preProcess(absorpTrainMSC,
                       method = c("center", "scale","pca"),
                       thresh = 0.95)
PC_scores_train<-predict.preProcess(absorp_pca,absorpTrainMSC)
plot(PC_scores_train[,1],PC_scores_train[,2],col="blue",
     xlim=c(-15,11),ylim = c(-20,11),
     xlab = "PC1",ylab = "PC2")
PC_scores_test<-predict.preProcess(absorp_pca,absorpTestMSC)
par(new=TRUE)
plot(PC_scores_test[,1],PC_scores_test[,2],col="red",
     xlim=c(-15,11),ylim = c(-20,11),
     xlab = "",ylab="")


Now we get the plot of the scores for the training set in blue and for the test set in blue.




9 oct. 2018

Playing with CARET and TECATOR MEAT DATA

###### LOADING TECATOR DATA ###################################
library(caret)
library(pls)
data(tecator)
#' Loading the tecator data we load two matrices:
    #' The spectra matrix "absorp" (raw spectra)
    #' We want to create another matrix with MSC math treatment
    absorpMSC<-msc(absorp)
    #' The constituents matrix "endpoints (Moisture, Fat & Protein)
set.seed(930)
#We can add names to the columns with the wavelengths values.
wavelengths<-as.matrix(seq(850,1048,by=2))
colnames(absorp)<-wavelengths
colnames(endpoints)<- c("Moisture","Fat","Protein")
#' We will model the protein content data and create a data partition
#' leaving 3/4 for the training set and 1/ for the validation set.
#' With the createDataPartition we generate a selection of sample positions
#' in a ramdon order to take after this samples out from the absorp and
#' endpoint matrices. 
###### SPLITTING THE DATA #####################################
trainMeats <- createDataPartition(endpoints[,3], p = 3/4)
#'Now we select the correspondant training and validation matrices
#'with the raw and MSC treated spectra
absorpTrain  <- absorp[trainMeats[[1]], ]
absorpTrainMSC<-as.matrix(absorpMSC[trainMeats[[1]], ])
absorpTest   <- absorp[-trainMeats[[1]], ]
absorpTestMSC   <- as.matrix(absorpMSC[-trainMeats[[1]], ])
#########  RAW SCAN SPECTRA  ##################################################
matplot(wavelengths,t(absorpTrain),type="l",
        xlab="wavelengths",ylab="Absorbance",col="blue")
par(new=TRUE)
matplot(wavelengths,t(absorpTest),type="l",
        xlab="",ylab="",col="green")
#########  MSC SCAN SPECTRA  ##################################################
matplot(wavelengths,t(absorpTrainMSC),type="l",xlab="wavelengths",
        ylab="transmitance",ylim =c(min(absorpTrainMSC)-0.1,
                                    max(absorpTrainMSC)+0.1),
        col="blue")
par(new=TRUE)
matplot(wavelengths,t(absorpTestMSC),type="l",xlab="wavelengths",
        ylab="transmitance",ylim =c(min(absorpTrainMSC)-0.1,
                                    max(absorpTrainMSC)+0.1),
                                    col="green")
#'and from the endpoint matrix for every constituent
moistureTrain <- endpoints[trainMeats[[1]], 1]
fatTrain <- endpoints[trainMeats[[1]], 2]
proteinTrain <- endpoints[trainMeats[[1]], 3]
# The rest of the samples go to the Validation Set
moistureTest <- endpoints[-trainMeats[[1]],1]
fatTest <- endpoints[-trainMeats[[1]], 2]
proteinTest  <- endpoints[-trainMeats[[1]], 3]
#We can combine these two matrices:
  # For Protein
trainDataProt<-cbind(proteinTrain,absorpTrain)         #Protein Raw Training
testDataProt<-cbind(proteinTest,absorpTest)            #Protein Raw Test
trainDataProtMSC<-cbind(proteinTrain,absorpTrainMSC)   #Protein Raw Training
testDataProtMSC<-cbind(proteinTest,absorpTestMSC)      #Protein Raw Test  
  #For Fat
trainDataFat<-cbind(fatTrain,absorpTrain)               #Fat Raw Training
testDataFat<-cbind(fatTest,absorpTest)                  #Fat Raw Test
trainDataFatMSC<-cbind(fatTrain,absorpTrainMSC)         #Fat MSC Training
testDataFatMSC<-cbind(fatTest,absorpTestMSC)            #Fat MSC Test
  #For Moisture
trainDataMoi<-cbind(moistureTrain,absorpTrain)          #Moisture Raw Training
testDataMoi<-cbind(moistureTest,absorpTest)             #Moisture Raw Test
trainDataMoiMSC<-cbind(moistureTrain,absorpTrainMSC)    #Moisture MSC Training
testDataMoiMSC<-cbind(moistureTest,absorpTestMSC)       #Moisture MSC Test
#####  BUILDING THE MODELS ####################################
#####  MODELS FOR MOISTURE
model_moi_raw <- train(moistureTrain~.,data=trainDataMoi, method = "pls",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 20)
model_moi_msc <- train(moistureTrain~.,data=trainDataMoiMSC, method = "pls",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 20)
#####  MODELS FOR FAT
model_fat_raw <- train(fatTrain~.,data=trainDataFat, method = "pls",
                       scale = TRUE,
                       trControl = trainControl("cv", number = 10),
                       tuneLength = 20)
model_fat_msc <- train(fatTrain~.,data=trainDataFatMSC, method = "pls",
                       scale = TRUE,
                       trControl = trainControl("cv", number = 10),
                       tuneLength = 20)
#####  MODELS FOR PROTEIN
model_prot_raw <- train(proteinTrain~.,data=trainDataProt, method = "pls",
                   scale = TRUE,
                   trControl = trainControl("cv", number = 10),
                   tuneLength = 20)
model_prot_msc <- train(proteinTrain~.,data=trainDataProtMSC, method = "pls",
                   scale = TRUE,
                   trControl = trainControl("cv", number = 10),
                   tuneLength = 20)
######  PREDICTIONS ########################################
## PROTEIN PREDICTIONS
pred_prot_test_raw <- predict(model_prot_raw,testDataProt)
pred_prot_test_msc <- predict(model_prot_msc,testDataProtMSC)
## FAT PREDICTIONS
pred_fat_test_raw <- predict(model_fat_raw,testDataFat)
pred_fat_test_msc <- predict(model_fat_msc,testDataFatMSC)
## MOISTURE PREDICTIONS
pred_moi_test_raw <- predict(model_moi_raw,testDataMoi)
pred_moi_test_msc <- predict(model_moi_msc,testDataMoiMSC)
## PREPARING DATA FOR MONITOR FUNCTION
compare<-cbind(moistureTest,pred_moi_test_raw,pred_moi_test_msc,
               fatTest,pred_fat_test_raw,pred_fat_test_msc,
               moistureTest,pred_moi_test_raw,pred_moi_test_msc)
ID<-seq(1,52,by=1)
compare<-cbind(ID,compare)
#### MONITORING AND STATISTICS  #########################
monitor10c26_003(compare[,c(1,2,3)])
monitor10c26_003(compare[,c(1,2,4)])
monitor10c26_003(compare[,c(1,5,6)])
monitor10c26_003(compare[,c(1,5,7)])
monitor10c26_003(compare[,c(1,8,9)])
monitor10c26_003(compare[,c(1,9,10)])
#' For Moisture and Fat there is an improvement using the model
#' with MSC math treatment,
#' For Protein the result are almost the same, but with the
#' raw spectra the is a certain slope and intercept problem,
#' and if corrected there is an improvement in the statistics.


3 oct. 2018

PC Regressions with CARET

I did started to use Caret, and I will continue using it, so I have to try a lot of things in R to become familiar with it.
 
In Caret the are a data set (data=tecator) from a Tecator instrument for meat analysis, working in transmitance and in the range from 850 to 1050 nm with a total of 100 data points.
 
The parameters are Moisture, Fat and Protein. You can play around with this data to become familiar with Caret, so I try to create a quick regression with PCR.
 
Caret let us prepare the Training and Testing Data in a random order and to train the model with several kinds of cross validations. So I wrote some code apart from the help I found in the available Caret Documentation.

data(tecator)
set.seed(930)
colnames(absorp) <- paste("x", 1:ncol(absorp))
## We will model the protein content data
trainMeats <- createDataPartition(endpoints[,3], p = 3/4)
absorpTrain  <- absorp[trainMeats[[1]], ]
proteinTrain <- endpoints[trainMeats[[1]], 3]
absorpTest   <- absorp[-trainMeats[[1]], ]
proteinTest  <- endpoints[-trainMeats[[1]], 3]
trainData<-cbind(proteinTrain,absorpTrain)
testData<-cbind(proteinTest,absorpTest)

model <- train(proteinTrain~.,data=trainData, method = "pcr",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 10)

names(model)
model$bestTune
summary(model$finalModel)
predictions <- predict(model,testData)
plot(predictions,proteinTest)


With this code we get plots and statistics. This is my first step into Caret, where I would like to go really deep into. So I hope to write more posts about this subjet.

14 sept. 2018

Monitoring Validations (Case 001)

I use R regularly for study the validations of different equations, in this case is an equation of cereals which include (barley, wheat, rye, corn, oat, triticale, ..). The monitor function in this case compare the starch values of an instrument consider as the Master (Y axis) and other consider as the Host (X axis).
 
The idea is to check if there are  differences which are important in order to take an action to adjust Bias or Slope and Intercept, or also consider to standardize the instruments.
 
In this case the Monitor function gives a warning to check if there are groups or extreme samples which can recommends the adjustment of slope and intercept.
 
 
And really there is a gap with two groups of samples, so we have to consider in this case what is happening: We have a group of barley samples with lower values of starch and a group with the wheat and corn samples with higher starch values.
 
In order to evaluate better we have to make subsets and check what is going on with the predictions statistics by groups and proceed the best way.
 


30 ago. 2018

Comparing Posteriors: Estimating Practical Differences Between Models


Is not the first time Max Kuhn appears in this blog and this time with a lecture (in the last New York R Conference) about advices to estimate what is the best model based on R statistics. Sure we can get good advices  to find the best model possible for our data sets.

29 ago. 2018

2018 New York R Conference Highlights


On April 20 this year the New York R Conference has been celebrated with a great  success.
Just look to the great atmosphere in the video of the conference.

16 ago. 2018

Checking the slopes in Validation Groups

This is an study to develop calibrations for meat in a reflectance instrument from 1100 to 1650 nm. Normally meat are measured in transmitance but this is an approach to do it in reflectance.
 
I have just 64 samples with fat laboratory data. I split the spectra into 4 sets of 16 samples and merge 3 of leaving the other three for external validation. So I have 48 samples for training and 16 for validation and I can develop four calibrations and validate with 4 external sets.
 
Considering that we have few samples are in the training set, I have to use few terms. The SEPs for external or Cross Validation are quite high , but the idea here is to see the changes in the slope for the four validation sets.
 
The reason is that we have few samples and the slope value will stabilize as soon as more samples are included into the calibration and validation sets.
 
 
 
To improve the SEP we have to check the sample presentation method for this product and the procedure to obtain the laboratory reference method.

3 ago. 2018

Monitoring the performance with the histogram

NIR can be used to detect levels food additives and check if they are in the right limits.
In this cases there are several types of doughs, and they use two levels of additive concentration depending on the type. So we have always the same reference data.
A calibration is developed and we have new data to validate. NIR will give  results which I expect to be covering the reference value with a Gauss distribution.
Using the Monitor function I can see the prediction distribution vs. the reference distribution and check if the expectations are fine.
 

In the case of the higher concentration is fine, and in the lower concentration is skewed (that is why the S/I adjust is suggested).This can be a first approach to continue with this application with mor accurate reference values.

13 jul. 2018

Validating Resemble Model with Monitor function

Continuing with this post evaluating the LOCAL model developed in Resemble. This time I use the Monitor function (one of the Monitor functions I am developing).

I create different subsets from the validation sample set for the different categories, In this case is for one time of puppies, and I am evaluating the moisture content. We can see that there are two outliers that increase the SEP, so we have to see if we remove this samples for some reasons.

Let´s validate first with this type of puppy subset and check the statistics:

> val1.moi.pup1<-subset(val1.moi,ID1u_moi=="PUPPY-1")
> val1.moi.pup1<-cbind.data.frame(val1.moi.pup1$Sample.u_moi, 
+                                 val1.moi.pup1$Yu_moi, 
+                                 val1.moi.pup1$predicted.moi.local)
 
> monitor10c24xyplot(val1.moi.pup1)



Samples with the Sample IDs 463 y 456 are out of the action limits and the monitor function shows their position in the table:
 
$ResWarning [1] id ref pred res <0 rows> (or 0-length row.names) $ResAction id ref pred res 34 456 3.7 7.793351 -4.093351 32 463 4.9 7.881543 -2.981543

Now we can remove this samples knowing their position and recalculate:
val1.moi.pup1<-val1.moi.pup1[-c(32,34),]
monitor10c24xyplot(val1.moi.pup1)





6 jul. 2018

Plots in Resemble (part 2)

Good results for the prediction of the validation samples (Xu, Yu) for protein. This is the XY plot where we can see in different colors the classes of the validation samples (different types of petfood). the SEP is 0.88 (without removing outliers) . Defining the data frame by classes will allow us to see the SEP for every class so we can check which class needs for more samples in the training database (Xr, Yr) or to check for other reasons.

plot(predicted.local,Yu_prot,col=val1.prot$ID1u_prot,lwd=2)

El error SEP de las muestras de validación de proteína es   :  0.887 
El R cuadrado para las muestras de validación de proteína es:  0.962 
 
 

29 jun. 2018

Plots in Resemble (Part 1)

Resemble allow a certain number of plots which are very useful for your works or personal papers. In this case I use the same sets than of the previous post and I plot the PCA scores, where I can see the training matrix (Xr) scores and the validation matrix (Xu) scores overlapped.
 
Validation set is a 35% (randomly selected) of the whole sample population obtained from a long time period.
 
We can see how the validation samples cover more or less the space of the training samples.

> par(mfrow=c(1,2))
> plot(local.mbl, g = "pca", pcs=c(1,2)) 
> plot(local.mbl, g = "pca", pcs=c(1,3))
 

28 jun. 2018

Using correlation in LOCAL (Resemble package)

> ctrl.mbl <- mblControl(sm = "cor",
                  pcSelection = list("cumvar", 0.999),
                  valMethod = "NNv",
                  scaled = FALSE, center = TRUE)
 > local.mbl <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
mblCtrl = ctrl.mbl,
                   dissUsage = "none",
                   k = seq(40, 150, by = 10),
                   pls.c = c(5, 15),
                   method = "wapls1")
 

Predicting sample: 1  ----------
Predicting sample: 2  ----------
Predicting sample: 3  ----------
Predicting sample: 4  ----------
Predicting sample: 5  ----------
--------------------------------
--------------------------------
 
> plot(predicted.local,Yu)

  
 

 
This time I use correlation (as Win ISI use) and try to find the best number of samples to select for the LOCAL algorithm with a sequence. 

As we can see the predictions improve with more samples in the calibration (red dots), maybe could be better win more samples by at the end of the plot it start to stabilize.
 
 
 
 
 

27 jun. 2018

LOCAL Calibrations with Resemble

Really interesting the Resemble package so I am trying to work and understand it better even to help me in the case of Win ISI LOCAL calibrations.
We can get predictions for different combinations of local selected samples for the calibration to predict the unknown, so we can see the best option. We use a certain number of terms (min. and max.) and a weighted average  is calculated.
 
In this case I use an external validation set of petfood Xu with Reference data (protein) Yu, and I want to know the statistics (RMSE and R square) for the case of 90 local samples selected:

predicted.local <- as.numeric(Yu_anl$Nearest_neighbours_90)
> rmse.local <- sqrt(mean((Yu - predicted.local)^2))
> R2.local <- cor(Yu, predicted.local)^2
> R2.local
[1] 0.9507232
> rmse.local
[1] 1.163304
 
plot(predicted.local,Yu)
There are a lot of options to explore, so I will continue checking this package.
 
 
 

26 jun. 2018

Memory based learning methods and tools (Resamble / LOCAL)

This is the link to this presentation which help us to understand the concept of LOCAL that we will treat during next posts with the "Resamble package" and we have treated and we will continue with LOCAL in Win ISI.

Developing LOCAL calibrations with R

We can use also LOCAL in R with the Resemble package. I am testing the package these days with a set of petfood spectra (with protein reference values) imported from Win ISI with SNV and a second derivative math treatment. After, I select 65% for training and the rest for test.
 
The get predictions process of Resemble allow a configuration to check for the better number of sample or factors for the better prediction, so there are a lot of options and functions to check in this package.
 
This is a plot of the results for a standard configuration from the reference manual, that I would try to go more deep into, trying to find the best configuration.

ctrl <- mblControl(sm = "pls",
                   pcSelection = list("opc", 40),
                   valMethod = c("NNv"),
                   scaled = FALSE, center = TRUE)


ex1 <- mbl(Yr = Yr, Xr = Xr, Yu = NULL, Xu = Xu,
           mblCtrl = ctrl,
           distUsage = "predictors",
           k = seq(30, 150, 15),
           method = "wapls1",
           pls.c= c(7, 20))

Yu_anl<-getPredictions(ex1)


Clearly seems that some of the configurations have overfitting, but I am just starting to learn the package so more post will come up giving my progress with this package.

14 jun. 2018

Precission and Accuracy LAB vs NIR

Finally happy with this plot trying to explain the precision and accuracy of the laboratory vs. the NIR predictions for a set of samples and subsamples. I will explain more detail of this plots in coming posts.

6 jun. 2018

Subsample Predictions Boxplot

This is a boxplot where there are four subsamples of meat meal predictions. A representative of a certain batch has been divided in four subsamples and analyzed in a NIR. So we get four predictions, one for every subsamples, so the  the boxplot gives an idea of the variance in the predictions for every sample based on their subsamples.
 
The colors are because the subsamples had been send to two different labs, so one is represented by one color. Colors had certain transparency because in some cases, two samples went to a lab and two to the other, in other cases the four subsamples went to the same lab and even in some cases three to one lab and one to another.
 
All these studies give an idea of the complexity of the meat meal product.



3 jun. 2018

Average and subsample residuals

In order to understand better the performance of a model, different blind subsamples of a sample had been sent to a laboratory, so in some cases we have the lab values of four subsamples of a sample and in other cases two subsamples of a sample. There are two cases with only one subsample.
 
For every subsample we calculate the average for the lab, and the average for the predictions, to get the red dot residuals.
 
We have also the residual of every subsample vs its prediction and those are the gray dots.
 
The plot (with R) gives a nice information about the performance of the model and how the average performs better in most cases than the individual subsamples.
 
We can see the warning (2.SEL) and action limits (3.SEL), and how the predictions for the average fall into the warning limits.
 
 

31 may. 2018

Comparing Residuals an Lab Error

One important point before  develop a calibration is to know the laboratory error. This error, known as SEL (Standard Error Laboratory).
This error change depending of the product, because it can be very homogenous or heterogeneous, so in the first case the lab error is lower than in the second case.
In the case of meat meal the error are higher than for other products and this is a case where I am working these days and I want to share in this posts.
A certain number of samples (n) well homogenized had been divided into two or four subsamples and had been send to a laboratory for the Dumas Protein analysis. After receive the results, a formula used for this case, based on the standard deviation of the subsamples and in the number of samples, gives the laboratory error for protein in meat meal.
The result is 1,3 . Probably some of you can think that this is a high value, but really, the meat meal product is quite complex and I think is normal.
 
Every subsample that went to the lab has been analyze in a NIR, and the spectra of the subsamples studied apart with the statistic RMS that shows that the samples were well homogenized.
Now I have the option to average the predictions of the NIR predictions (case 1), or to average the spectra of the subsamples, predict it with the model and get the predicted result (case 2). I use in this case the option 1 and plot a residual plot with the residuals of the average predictions subtracted from the lab average value:

Blue line is +/- “ 1.SEL”, yellow +/- “ 2.SEL” and red +/- “ 3.SEL”.
 

 

20 may. 2018

Monitoring LOCAL calibrations (part 2)

Following with the previous post when monitoring LOCAL equations we have to consider several points. Some of them could be.
  1. Which is the reference method and what is the laboratory error for the reference method. Ex:Reference method for protein can be from Dumas or from Kjeldahl .
  2. Do you have data from several laboratories or from just one?. Which reference method use every lab and what is their laboratory error. 
  3. If we have data in the Monitor from both methods, or from different laboratories, split them apart and see which error  SEP do we have. Check also the SEP corrected by the Bias.
  4. Check the performance of the SEP and other statistics for every specie. In meat meal for example you have pork, mix, poultry, feather,....
Sure you can find more knowledge about the data:
  1. Do we have more error with one method than another, with one lab than another?.
  2. Does perform a reference method better for one particular specie than another?.
  3. ................................
Make conclusions from these and other thoughts.
 
For all these, is important to organize well the data and use it fine. Don´t forget that we are comparing reference versus predicted, and that the prediction depends of the model itself apart from the reference values, so we have to be sure that our model is not overfitted or wrong developed.

19 may. 2018

Monitoring LOCAL calibrations

Databases to develop Local calibrations has normally a high quantity of spectra with lab values, but we have to take care of  them adding new sources of variance. This way we make them more robust and the standard prediction errors (SEP) decrease when we validate with future independent validation sets.
 
This was the case with a meat meal local database updated with 100 samples with protein values, and with new source of variation as: laboratories,  reference methods (Kj, Dumas), providers, new instruments,...
 
After the LOCAL database update, a set of new samples was received with reference values and I have predicted this values with the Monitor function in Win ISI with the LOCAL database before (blue dots) and after the LOCAL update (red dots).
 
The errors decrease , specially for some types of samples, in an important way when validating with the new set of samples (new samples acquired after the updated Local calibration was installed in the routine software), so even if we have spectra from this instrument, labs, ...., this set has to be considered as an independent set.
 
I don´t give details of the statistics but this picture show the same samples predicted with the LOCAL calibration without update (in blue), and predicted with the LOCAL calibration update (in red), the improvement for the prediction for some samples is important, so the idea is to add this new samples and continuing monitoring the LOCAL database with future validation sets.




17 may. 2018

Nils Foss died on 16-May-2018


Sad news from FOSS: Nils Foss (Foss founder) died yesterday.
He was an good example of entrepreneur.
R.I.P.

15 may. 2018

Random selection before calibration.

Following a course about Machine Learning with R, I realize of the importance of the random selection of the samples for the development of the models.
R has good tools to select the samples randomly and to do it was a common practice during all the exercises in the course.
 
I do it, in the case with the soy meal samples I have used for several posts, so we will compare the results.
 
The idea of random selection is to make the model robust against the bias which we see quite often when validating the models with independent samples.
 
We can see if the number of the terms selected change or if we get similar results to the previous selection using an structure selection of odd and even samples.
 
Random selection order is important also for a better cross validation.
 
Here is the code and preparation of the sample sets for the development of the models.


##################### SPLITTING SAMPLES RANDOMLY #############  
#In this case we need first the dataframe "soy_ift_conv"
#Split the data into 65% training and 35% test

rndIndices=sample(nrow(soy_ift_conv))
sepPnt=round(.65*nrow(soy_ift_conv))
train=soy_ift_conv[rndIndices[1:sepPnt],]
validation=soy_ift_conv[rndIndices[(sepPnt+1):length(rndIndices)],]
#Plotting Training and Validation sets overlapped.
matplot(wavelengths,t(train$X_msc),type="l",
              xlab="wavelengths",ylab="Absorbance"
              ,col="blue")
par(new=TRUE)
matplot(wavelengths,t(validation$X_msc),lty=1,

        pch=NULL,axes=FALSE,
        type="l",col="gray",xlab="",ylab="")


We see in gray the validation samples selected and in blue the training samples.



       
 

2 may. 2018

First approach to ANN calibrations

This is a first approach to develop ANN calibrations with the soymeal data from Infratec and it is really promising.
I follow this code and plot the results:

#We build a dataframe with the constituent values
#and with the spectra math treated with MSC
Sample<-sm_ift[,1]
Y<-as.matrix(sm_ift[,2])        # La matriz Y son los datos de proteina.
which(is.na(Y), arr.ind=TRUE)   # La 462 tiene por valor NA y la quitamos
Y<-Y[-462,]
X<-as.matrix(sm_ift[,6:105]) # La matriz X son los datos NIR
X<-X[-462,]
library(ChemometricsWithR)
library(pls)
X<-msc(X)
##====================================================================
##PRINCIPAL COMPONENTS ANALYSIS using package "Chemometrics" NIPALS)
##====================================================================
library(chemometrics)
X_nipals<-nipals(X,a=4)
T<-X_nipals$T
T<-round(T,4)
T1<-T[,1]
T2<-T[,2]
T3<-T[,3]
T4<-T[,4]
P<-X_nipals$P
P<-round(P,4)
###################################################################

soymeal=data.frame(T1=T1,T2=T2,T3=T3,T4=T4,Y=Y)
#' Split the data into 65% training and 35% test
rndIndices=sample(nrow(soymeal))
sepPnt=round(.65*nrow(soymeal))
train=soymeal[rndIndices[1:sepPnt],]
test=soymeal[rndIndices[(sepPnt+1):length(rndIndices)],]

#' Create an neural network model with 3 hidden
#' nodes of Y~. using the training data.
#' Store this model as 'model'
library(datasets)
library(nnet)
library(graphics)
model=nnet(Y~.,train,size=4,linout=T)

#' Use this model to estimate the Y values of the test data
pre=(predict(model,test))
pre=round(pre,2)
#' Calculate the MSE of the model on the test data and output
#' it using the print or cat functions
mse=mean((test$Y-pre)^2)
cat("The MSE of the model on the test data is: ",mse,"\n")
#The MSE of the model on the test data is:  0.9314747
plot(pre,test$Y)
abline(0,1)



The loss function in this case is MSE (Mean Square Error).
More practice these coming days

30 abr. 2018

Validation with LOCAL calibrations

When developing a LOCAL calibration in Win ISI, we use an input file and a Library file with the idea the idea to select the best possible model, so the selection of a well representative input (let´s call validation set) is very important to have success in the development of the model.
 
So the model is conditioned to the input file, if we have choose another input file we could have get another model which performs different, so the need of a Test Set is obvious to check how the model performs with new data.
 
It is important to have this in mind, so one proposal would be to divide (randomly) the data in three sets: 60% for training, 20% for Input or Validation, and another 20% for testing.
 
There are other ways to sort the data in order to select these three Sets (time, seasons, species,...). One thing is clear, some of the models developed will perform better than others, so you can keep several of them and you can check this when you have new data and use an statistic an MSE (Mean Square Error) to compare them.

19 abr. 2018

Projections over the planes in PC space

This is a plot in 3D to see the orthogonal projections over the plane formed by first and second PC calculated with SVD. Once projected over the plane, projections arte projected again on the new axis (all them: terms or loadings,.....) to calculate the score for every PC.
Plot can be moved with the mouse so find the better view.

Data is centered so we see samples on both sides of the plane.



17 abr. 2018

Covariance plot and Mahalanobis ellipse

In the previous post we have seen  the correlation plot and in this one the covariance plot  matrix, we just have to change the function "cor" for "cov" in the code.
 
Covariance  matrix is used frequently in chemometrics, like in this case to define the ellipse for the Mahalanobis distance using the covariance between the two axes (one for the 964 wavelength and the other the linear combination of wavelengths 1022 and 902 that we have seen in other recent post.
 
We can do this easily with the package chemometrics and this code:

x3_x1x2<-cbind(x3,x1x2)
library(chemometrics)
drawMahal(x3_x1x2,center=apply(x3_x1x2,2,mean),
          covariance=cov(x3_x1x2),

          quantile=0.975,col="blue")

to get this plot:

16 abr. 2018

Correlation Matrix plot

In the last posts we are talking about the wavelength space due to its high collinearity, because we want to select wavelengths with few correlation between them in order to develop a model.
 
In this task we can check the correlation matrix, which is better to check in a plot than with numbers. This is the plot for the soy meal samples in transmitance using the 100 wavelengths from 850 to 1048 nm in steps of 2 nm, so the correlation matrix is a 100.100 diagonal and symmetric matrix as can be seen in the plot.
 
The red line is the correlation of the 962 nm wavelength with all the rest including itself (1 in this case). The vertical blue lines are the wavelengths at 1022,902 and 962 used in the recent posts.
 
See the Correlation matrix plot and code: 
cor_Xcmsc<-cor(X_msc_centered)
matplot(wavelengths,t(cor_Xcmsc),type="l",

        xlab="wavelengths",
        ylab="Correlation",

        col="grey",ylim=c(-1.00,1.00))
par(new=TRUE)
matplot(wavelengths,cor_Xcmsc[58,],type="l",

        xlab="wavelengths",
        ylab="Correlation",

        col="red",ylim=c(-1.00,1.00))
abline(v=1022,col="blue")
abline(v=902,col="blue")
abline(v=964,col="blue")