11 ene. 2019

Correcting skewness with Box-Cox

We can use with Caret the function BoxCoxTrans to correct the skewness. With this function we get the lambda value to apply to the Box-Cox formula, and get the correction. In the case of lambda = 0 the Box-Cox transformation is equal to log(x), if lambda = 1 there are not skewness so not transformation is needed, if equals 2 the square transformation is needed and several math functions can be applied depending of the lambda value.

In the case of the previous post (correcting skewness with logs)if we use the Caret function "BoxCoxTrans", we get this result:

> VarIntenCh3_Trans
Box-Cox Transformation

1009 data points used to estimate Lambda
Input data summary:
  Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
0.8693  37.0600  68.1300 101.7000 125.0000 757.0000

Largest/Smallest: 871
Sample Skewness: 2.39

Estimated Lambda: 0.1
With fudge factor, Lambda = 0 will be used for transformations

So, if we apply this transformation, we will get the same skewness value and histogram than when applying logs.

9 ene. 2019

Correcting the skewness with logs

It is recommended to look to the histograms to check if the distributions of the predictors, variables or constituents are skewed in some way. I use in this case a predictor of the segmentation original data from the library "Applied Predictive Modeling". where we can find many predictor to check if the cell are well or poor segmented.
If you want to check the paper for this work you can see this link:
One of the predictors for this work is VarIntenChn3, and we can check the histogram:
              [1] 2.391624
As we can see the histogram is skewed to the right, so we can apply a transformation to the data to remove the skewness. There are several transformations, and this time we check applying Logs.
               [1] -0.4037864
As we can see the histogram looks more to a Normal distribution, but a little bit skewed to the left.

6 ene. 2019

Correlation Plots (Segmentation Data)

First I would like to wish to the readers of this blog all the best along this 2019.
Recently it has been my birthday and I receive as present the book "Applied Predictive Modelling" wrote by Max Kuhn and Kjell Johnson. It is really a great book for those who like R for predictive modelling and to get more knowledge about the Multivariate Analysis. Sure a lot of post will come inspired by this book along this year.
I remember when I started with R in this blog I post plots of the correlation matrix to show how the wavelengths in a near infrared spectrum are correlated and why for that reason we have to use techniques like PCA to create uncorrelated predictors.
In R there is a package called like the book "Applied Predictive Modelling", where we can find the "Cell Segmentation Data", which Max Kuhn use quite often on his webinars (you can find them available in YouTube).
These Cell Segmentation Data has 61 predictors, and we want to see the correlation between them, so with some code we isolate the training data and use only the numeric values of the predictors to calculate the correlation matrix:


data(segmentationData)   # Load the segmentation data set
trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE)
trainData <- segmentationData[trainIndex,]
testData  <- segmentationData[-trainIndex,]
trainX <-trainData[,4:61]        # only numeric values

corrplot(M,tl.cex = 0.3)

This way we get a nice correlation plot:

 This plot is easier to check than the whole correlation matrix in numbers.

Now we can isolate areas of this matrix, like the one which shows higher correlation between the variables:

corrplot(M[14:20,14:20],tl.cex = 0.8

6 dic. 2018

Foss Calibrator (quick mPLS overview)

I am starting to use the new software Foss Calibrator, so I will publish some posts about how it works. I use in this case the software for some samples of meat for a viability study of the calibration, and the software improves the split of the sample set into a validation and a calibration set, giving several options like random, time based,...We can choose also if the validation set is into the range of the calibration set, so the model has all the validation samples into the range of the constituent calibration, this way we have quickly the calibration and validation set ready to develop the calibration.

For the calibration we have several options for the cross validation (leave one out, using blocks, venetian blinds,......).
We can choose for developing the calibration the options: mPLS, PLS, ANN or LOCAL.I try for this case the mPLS models.
We can select the wavelength range, so we have to look to the spectra to see how if looks and remove noisy part of the spectra, or remove the visible part,.....
The XY plot of Measured vs Predicted shows the calibration and validation samples overlapped and is quite useful for a quick idea of the performance of the model.
We have also the plot of the GH distances with the calibration and validation values overlapped:

We see the statistics of the model and this time the RMSEP is the total error and the SEP is the error with the bias correction which makes easier to compare the results with other software or literature.
We can publish the model (calibration and outlier model together) to a folder in our PC and get the ".eqa", ".pca", and ".lib" files to use in Win ISI or load in MOSAIC Network or Solo, and get a report of the calibration.
I will continue sharing my experience with Foss Calibrator with the Label "Learning Foss Calibrator"

1 dic. 2018

"R" en la Jornadas Técnicas NIR de FOSS

El pasado martes 27 de Noviembre se celebraron en Madrid la Jornadas Técnicas NIR de FOSS, un evento anual en el que FOSS presenta en España las nuevas tendencias de la instrumentación NIR en FOSS que van dirigidas a la digitalización con productos como el "Foss Assure", "Mosaic" y "Foss Calibrator" entre otros. Con ellos el DS2500 o DA1650 (entre otros) se están convirtiendo en productos que van a poder beneficiarse de estas herramientas digitales e ir evolucionando con ellas.

Foss Calibrator será la nueva plataforma de calibración que sucederá a Win ISI en un futuro próximo. Por supuesto Win ISI ha estado presente en las Jornadas y también ha sido para mí un placer que "R" lo estuviese en mi ponencia.

"R" despertó un gran interés de lo cual me alegro, ya que los usuarios de este software deseamos promocionarlo para que se vea y use el enorme potencial que tiene.

11 nov. 2018

Variable Importance in NIR "PLS" Models (CARET)

This is a function of the R Caret package to check the importance of the variables in a regression. In the case of the model developed with the sunflower seed to determine oleic acid (model_oleic), we can plot it and check which variables have more importance and this is done with a simple step:
And the best way to check it is plotting the results as a spectrum:
        ylim =c(min(varImp_pls$importance)-0.1,
To obtain this spectra:
We can see that the zone of 1700 to 1800 has higher important than the rest due to the peaks linked to the "oil" around 1720 and 1760 nm.