11 nov. 2018

Variable Importance in NIR "PLS" Models (CARET)

This is a function of the R Caret package to check the importance of the variables in a regression. In the case of the model developed with the sunflower seed to determine oleic acid (model_oleic), we can plot it and check which variables have more importance and this is done with a simple step:
 
varImp_pls<-varImp(model_oleic)
 
And the best way to check it is plotting the results as a spectrum:
 
matplot(wavelengths,varImp_pls$importance,type="l",
        xlab="wavelengths",
        ylab="importance",
        ylim =c(min(varImp_pls$importance)-0.1,
                max(varImp_pls$importance)+0.1),
        col="blue")
 
To obtain this spectra:
We can see that the zone of 1700 to 1800 has higher important than the rest due to the peaks linked to the "oil" around 1720 and 1760 nm.
 

8 nov. 2018

¿Cuando los resultados que predice el NIR son fiables?

Esta es una de las preguntas más comunes que se hacen los usuarios del NIR y que voy a tratar de simplificar una respuesta.
Cuando se instala un modelo de calibración, este lleva unos estadísticos que indican los errores de la calibración, y se conocen como:
SEC (Error Estándar de Calibración) y SECV (Error Estándar de Validación Cruzada)
 
Posteriormente a la instalación de la ecuación para su uso en rutina, y cuando ya tenemos unas 20 muestras de cada parámetro, podemos calcular el SEP (Error Estándar de Predicción) y hacer una comparativa de los tres errores para sacar conclusiones. La situación ideal es cuando los tres estadísticos de error son parecidos y lo suficientemente bajos como para poder usar la aplicación por NIR .
 
Tenemos que tener en cuenta que conocer el error de laboratorio (SEL) nos ayudará a conocer el ratio de error entre el NIR y el Laboratorio y tomar también decisiones para ver como bajarlo (presentando la muestra de modo diferente,....,etc).
Cuando validamos y calculamos el SEP, podemos generar otros estadísticos para ver si la ecuación se está comportando correctamente o no, basándonos en la ecuación que tenemos instalada y para ello se necesitan el número de muestras que hay en la calibración instalada así como el número de términos que se usó, el nivel de confianza que queremos y el SEC o SECV. Con ello se generan unos limites de confianza para el Bias, la Pendiente y el SEP. En el caso de que los resultados estén dentro de los límites, podemos seguir usando el modelo con confianza en los márgenes de error dados inicialmente.
 
Este es un ejemplo:
En un modelo para Oleico en pipa de Girasol molida (con molino de tipo Moulinex), con 109 muestras, 5 términos , un nivel de confianza de 95% y un SECV de 2,21 los resultados del test de la ISO 12099:2010 son:
 
 
Como podemos comprobar el SEP está dentro de los límites previstos y el modelo puede seguir siendo usado. No obstante tenemos que sacar conclusiones para ver como mejorarlo, y el XY plot nos puede ayudar:
 
Parece probable que con una mejora en la molienda de la muestra, con replicados, u otros métodos de presentación el modelo puede ser mejorado, pero tal como se está realizando actualmente es operativo.

30 oct. 2018

Confusion Matrix with Caret


This is a useful tool in R in order to evaluate a predictive model for classification. We know the expected value and the predicted on and from that we can get the Confusion Matrix and the useful statistics based by formulas from that matrix.
I reproduce here the code from the post: "How To Estimate Model Accuracy in R Using The Caret Package"  from the blog "Machine Learning Mastery":

# load the libraries
library(caret)
library(klaR)
# load the iris dataset
data(iris)
# define an 80%/20% train/test split of the dataset
split=0.80
trainIndex <- createDataPartition(iris$Species,

                                  p=split,
                                  list=FALSE)
data_train <- iris[ trainIndex,]
data_test <- iris[-trainIndex,]
# train a naive bayes model
model <- NaiveBayes(Species~., data=data_train)
# make predictions
x_test <- data_test[,1:4]
y_test <- data_test[,5]
predictions <- predict(model, x_test)
# summarize results
confusionMatrix(predictions$class, y_test)


Try to understand the results, some samples are well classified and others not. So we must try to find the model where we have the better statistics for the classification. This is a simple example, but why not to try this machine learning algorithms to spectra for classification and use the confusion matrix to get the best model.
The statistics we get running the last line of code are:
> confusionMatrix(predictions$class, y_test)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0          9         1
  virginica       0          1         9

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.7793, 0.9918)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 8.747e-12       
                                          
                  Kappa : 0.9             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9000           0.9000
Specificity                 1.0000            0.9500           0.9500
Pos Pred Value              1.0000            0.9000           0.9000
Neg Pred Value              1.0000            0.9500           0.9500
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3000           0.3000
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9250           0.9250

An easy example to understand the confusion matrix can be with this code:
library(caret)
expected <- factor(c(1, 1, 0, 1, 0, 0, 1, 0, 0, 0))
predicted <- factor(c(1, 0, 0, 1, 0, 0, 1, 1, 1, 0))
results <- confusionMatrix(data=predicted, reference=expected)
print(results)

Where you get:
> print(results)
Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 4 1
         1 2 3
                                          
               Accuracy : 0.7             
                 95% CI : (0.3475, 0.9333)
    No Information Rate : 0.6             
    P-Value [Acc > NIR] : 0.3823          
                                          
                  Kappa : 0.4             
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.6667          
            Specificity : 0.7500          
         Pos Pred Value : 0.8000          
         Neg Pred Value : 0.6000          
             Prevalence : 0.6000          
         Detection Rate : 0.4000          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.7083   

From the Caret Documentation which are the formulas for these statistics:

25 oct. 2018

Building Predictive Models in R Using the caret Package

I recommend the reading and practice of the paper :

Building Predictive Models in R Using the caret Package

you can follow the Tutorial with the Mutagen Data in R is a good practice.
The code is in the paper, but in some cases we have to work with R to do certain steps like the code in red.

library(caret)
set.seed(1)
in.Train<-createDataPartition(mutagen,p=3/4,list=FALSE)
trainDescr<-descr[in.Train,]              #used for model training
testDescr<-descr[-in.Train,]               #used to evaluate model performance
trainClass<-mutagen[in.Train]           #used for model training
testClass<-mutagen[-in.Train]           #used to evaluate model performance
prop.table(table(mutagen))                #distribution mutagen all
prop.table(table(trainClass))             #distibution of the training set
#There were three zero{variance predictors in the training data.
sum(apply(trainDescr, 2, var) == 0)     # 3
variance<-apply(trainDescr, 2, var)
zv<-variance==0
which(zv, arr.ind = TRUE, useNames = TRUE)
#T.F..Br. G.F..Br.    I.097
#155      708         1539
trainDescr<-trainDescr[,-c(155,708,1539 )]  #zero variance descriptors removed
testDescr<-testDescr[,-c(155,708,1539 )]    #zero variance descriptors removed

#We also remove predictors to make sure that there are no
#between-predictor (absolute) correlations greater than 90%:
ncol(trainDescr)                        #1576
descrCorr<-cor(trainDescr)              #Correlation Matrix   1579.1579
highCorr<-findCorrelation(descrCorr,0.90)
#Remove the high correlated descriptors from the Training and Test sets
trainDescr<-trainDescr[,-highCorr]
testDescr<-testDescr[,-highCorr]
ncol(trainDescr)                        #650

14 oct. 2018

CARET:Splitting Based on the Predictors

I´m practicing with CARET, and the best way is to follow the tutorials in the webpage. This time is the way how we can split the data with Caret:

4.2:Splitting Based on the Predictors

Read and try to understand the concept.
I try to write the code of the plot for the plot  of the figure, and finally more or less I do it:

testing <- scale(BostonHousing[, c("age", "nox")])
set.seed(5)
## A random sample of 5 data points
startSet <- sample(1:dim(testing)[1], 5)
samplePool <- testing[-startSet,]
start <- testing[startSet,]
newSamp <- maxDissim(start, samplePool, n = 20)
newSamp<-samplePool[newSamp,]
rownames(newSamp)<-c(1:20)
plot(samplePool[,1],samplePool[,2],pch=20,

     col="grey",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="age",ylab="nox")
par(new=TRUE)
plot(start[,1],start[,2],pch="S",

     col="red",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="",ylab="",cex=1.3, font=2)
par(new=TRUE)
plot(newSamp[,1],newSamp[,2],col="blue",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="",ylab="")
text(newSamp[,1],newSamp[,2],

     labels=rownames(newSamp),cex=1.3,     
     font=2)

The samples chosen are different because of the random order. See how the distribution of the chosen samples cover the structure of the data.

10 oct. 2018

PCA with Caret

In this plot we teste different types of Principal Components Analysis with different packages. This time I use Caret.

I use the same Tecator Meat data which comes with the package. Spectra is treated with MSC  (Multiple Scatter Correction) and I represent the plane of the scores with the two terms chosen by the PCA processing:

absorp_pca<-preProcess(absorpTrainMSC,
                       method = c("center", "scale","pca"),
                       thresh = 0.95)
PC_scores_train<-predict.preProcess(absorp_pca,absorpTrainMSC)
plot(PC_scores_train[,1],PC_scores_train[,2],col="blue",
     xlim=c(-15,11),ylim = c(-20,11),
     xlab = "PC1",ylab = "PC2")
PC_scores_test<-predict.preProcess(absorp_pca,absorpTestMSC)
par(new=TRUE)
plot(PC_scores_test[,1],PC_scores_test[,2],col="red",
     xlim=c(-15,11),ylim = c(-20,11),
     xlab = "",ylab="")


Now we get the plot of the scores for the training set in blue and for the test set in blue.




9 oct. 2018

Playing with CARET and TECATOR MEAT DATA

###### LOADING TECATOR DATA ###################################
library(caret)
library(pls)
data(tecator)
#' Loading the tecator data we load two matrices:
    #' The spectra matrix "absorp" (raw spectra)
    #' We want to create another matrix with MSC math treatment
    absorpMSC<-msc(absorp)
    #' The constituents matrix "endpoints (Moisture, Fat & Protein)
set.seed(930)
#We can add names to the columns with the wavelengths values.
wavelengths<-as.matrix(seq(850,1048,by=2))
colnames(absorp)<-wavelengths
colnames(endpoints)<- c("Moisture","Fat","Protein")
#' We will model the protein content data and create a data partition
#' leaving 3/4 for the training set and 1/ for the validation set.
#' With the createDataPartition we generate a selection of sample positions
#' in a ramdon order to take after this samples out from the absorp and
#' endpoint matrices. 
###### SPLITTING THE DATA #####################################
trainMeats <- createDataPartition(endpoints[,3], p = 3/4)
#'Now we select the correspondant training and validation matrices
#'with the raw and MSC treated spectra
absorpTrain  <- absorp[trainMeats[[1]], ]
absorpTrainMSC<-as.matrix(absorpMSC[trainMeats[[1]], ])
absorpTest   <- absorp[-trainMeats[[1]], ]
absorpTestMSC   <- as.matrix(absorpMSC[-trainMeats[[1]], ])
#########  RAW SCAN SPECTRA  ##################################################
matplot(wavelengths,t(absorpTrain),type="l",
        xlab="wavelengths",ylab="Absorbance",col="blue")
par(new=TRUE)
matplot(wavelengths,t(absorpTest),type="l",
        xlab="",ylab="",col="green")
#########  MSC SCAN SPECTRA  ##################################################
matplot(wavelengths,t(absorpTrainMSC),type="l",xlab="wavelengths",
        ylab="transmitance",ylim =c(min(absorpTrainMSC)-0.1,
                                    max(absorpTrainMSC)+0.1),
        col="blue")
par(new=TRUE)
matplot(wavelengths,t(absorpTestMSC),type="l",xlab="wavelengths",
        ylab="transmitance",ylim =c(min(absorpTrainMSC)-0.1,
                                    max(absorpTrainMSC)+0.1),
                                    col="green")
#'and from the endpoint matrix for every constituent
moistureTrain <- endpoints[trainMeats[[1]], 1]
fatTrain <- endpoints[trainMeats[[1]], 2]
proteinTrain <- endpoints[trainMeats[[1]], 3]
# The rest of the samples go to the Validation Set
moistureTest <- endpoints[-trainMeats[[1]],1]
fatTest <- endpoints[-trainMeats[[1]], 2]
proteinTest  <- endpoints[-trainMeats[[1]], 3]
#We can combine these two matrices:
  # For Protein
trainDataProt<-cbind(proteinTrain,absorpTrain)         #Protein Raw Training
testDataProt<-cbind(proteinTest,absorpTest)            #Protein Raw Test
trainDataProtMSC<-cbind(proteinTrain,absorpTrainMSC)   #Protein Raw Training
testDataProtMSC<-cbind(proteinTest,absorpTestMSC)      #Protein Raw Test  
  #For Fat
trainDataFat<-cbind(fatTrain,absorpTrain)               #Fat Raw Training
testDataFat<-cbind(fatTest,absorpTest)                  #Fat Raw Test
trainDataFatMSC<-cbind(fatTrain,absorpTrainMSC)         #Fat MSC Training
testDataFatMSC<-cbind(fatTest,absorpTestMSC)            #Fat MSC Test
  #For Moisture
trainDataMoi<-cbind(moistureTrain,absorpTrain)          #Moisture Raw Training
testDataMoi<-cbind(moistureTest,absorpTest)             #Moisture Raw Test
trainDataMoiMSC<-cbind(moistureTrain,absorpTrainMSC)    #Moisture MSC Training
testDataMoiMSC<-cbind(moistureTest,absorpTestMSC)       #Moisture MSC Test
#####  BUILDING THE MODELS ####################################
#####  MODELS FOR MOISTURE
model_moi_raw <- train(moistureTrain~.,data=trainDataMoi, method = "pls",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 20)
model_moi_msc <- train(moistureTrain~.,data=trainDataMoiMSC, method = "pls",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 20)
#####  MODELS FOR FAT
model_fat_raw <- train(fatTrain~.,data=trainDataFat, method = "pls",
                       scale = TRUE,
                       trControl = trainControl("cv", number = 10),
                       tuneLength = 20)
model_fat_msc <- train(fatTrain~.,data=trainDataFatMSC, method = "pls",
                       scale = TRUE,
                       trControl = trainControl("cv", number = 10),
                       tuneLength = 20)
#####  MODELS FOR PROTEIN
model_prot_raw <- train(proteinTrain~.,data=trainDataProt, method = "pls",
                   scale = TRUE,
                   trControl = trainControl("cv", number = 10),
                   tuneLength = 20)
model_prot_msc <- train(proteinTrain~.,data=trainDataProtMSC, method = "pls",
                   scale = TRUE,
                   trControl = trainControl("cv", number = 10),
                   tuneLength = 20)
######  PREDICTIONS ########################################
## PROTEIN PREDICTIONS
pred_prot_test_raw <- predict(model_prot_raw,testDataProt)
pred_prot_test_msc <- predict(model_prot_msc,testDataProtMSC)
## FAT PREDICTIONS
pred_fat_test_raw <- predict(model_fat_raw,testDataFat)
pred_fat_test_msc <- predict(model_fat_msc,testDataFatMSC)
## MOISTURE PREDICTIONS
pred_moi_test_raw <- predict(model_moi_raw,testDataMoi)
pred_moi_test_msc <- predict(model_moi_msc,testDataMoiMSC)
## PREPARING DATA FOR MONITOR FUNCTION
compare<-cbind(moistureTest,pred_moi_test_raw,pred_moi_test_msc,
               fatTest,pred_fat_test_raw,pred_fat_test_msc,
               moistureTest,pred_moi_test_raw,pred_moi_test_msc)
ID<-seq(1,52,by=1)
compare<-cbind(ID,compare)
#### MONITORING AND STATISTICS  #########################
monitor10c26_003(compare[,c(1,2,3)])
monitor10c26_003(compare[,c(1,2,4)])
monitor10c26_003(compare[,c(1,5,6)])
monitor10c26_003(compare[,c(1,5,7)])
monitor10c26_003(compare[,c(1,8,9)])
monitor10c26_003(compare[,c(1,9,10)])
#' For Moisture and Fat there is an improvement using the model
#' with MSC math treatment,
#' For Protein the result are almost the same, but with the
#' raw spectra the is a certain slope and intercept problem,
#' and if corrected there is an improvement in the statistics.


3 oct. 2018

PC Regressions with CARET

I did started to use Caret, and I will continue using it, so I have to try a lot of things in R to become familiar with it.
 
In Caret the are a data set (data=tecator) from a Tecator instrument for meat analysis, working in transmitance and in the range from 850 to 1050 nm with a total of 100 data points.
 
The parameters are Moisture, Fat and Protein. You can play around with this data to become familiar with Caret, so I try to create a quick regression with PCR.
 
Caret let us prepare the Training and Testing Data in a random order and to train the model with several kinds of cross validations. So I wrote some code apart from the help I found in the available Caret Documentation.

data(tecator)
set.seed(930)
colnames(absorp) <- paste("x", 1:ncol(absorp))
## We will model the protein content data
trainMeats <- createDataPartition(endpoints[,3], p = 3/4)
absorpTrain  <- absorp[trainMeats[[1]], ]
proteinTrain <- endpoints[trainMeats[[1]], 3]
absorpTest   <- absorp[-trainMeats[[1]], ]
proteinTest  <- endpoints[-trainMeats[[1]], 3]
trainData<-cbind(proteinTrain,absorpTrain)
testData<-cbind(proteinTest,absorpTest)

model <- train(proteinTrain~.,data=trainData, method = "pcr",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 10)

names(model)
model$bestTune
summary(model$finalModel)
predictions <- predict(model,testData)
plot(predictions,proteinTest)


With this code we get plots and statistics. This is my first step into Caret, where I would like to go really deep into. So I hope to write more posts about this subjet.

14 sept. 2018

Monitoring Validations (Case 001)

I use R regularly for study the validations of different equations, in this case is an equation of cereals which include (barley, wheat, rye, corn, oat, triticale, ..). The monitor function in this case compare the starch values of an instrument consider as the Master (Y axis) and other consider as the Host (X axis).
 
The idea is to check if there are  differences which are important in order to take an action to adjust Bias or Slope and Intercept, or also consider to standardize the instruments.
 
In this case the Monitor function gives a warning to check if there are groups or extreme samples which can recommends the adjustment of slope and intercept.
 
 
And really there is a gap with two groups of samples, so we have to consider in this case what is happening: We have a group of barley samples with lower values of starch and a group with the wheat and corn samples with higher starch values.
 
In order to evaluate better we have to make subsets and check what is going on with the predictions statistics by groups and proceed the best way.
 


30 ago. 2018

Comparing Posteriors: Estimating Practical Differences Between Models


Is not the first time Max Kuhn appears in this blog and this time with a lecture (in the last New York R Conference) about advices to estimate what is the best model based on R statistics. Sure we can get good advices  to find the best model possible for our data sets.