25 jun. 2019

More about Mahalanobis distance in R

There are several Mahalanobis distance post in this blog, and this post show a new way to find outliers with a library in R called "mvoutlier".
 
Mahalanobis ellipses can only be shown in 2 dimensions with a cutoff value as we have seen, so we show the maps of scores 2 by 2 for the different combinations of PCs, like in this case for PC1 and PC2 and we can mark the outliers in the plot by the identify function: 


In this case I mark some of the samples out of the Mahalanobis distance cutoff. Anyway the Mahalanobis distance is univariate and in this case where we have a certain number of PCs, we have to see not just a map of two of them or all at the same time, we need a unique Mahalanobis distance value and to check if that value is over or into the cutoff value that we assign.
 
For that reason we use the Moutlier function of the "chemometrics" package and show a real Mahalanobis outlier plot which can be Robust or Classical:
 
We can see the classical plot and identify the samples over the cutoff:
 
We can see the list of all the distances in the output list for the function. I will continue with more options to check the Mahalanobis distances in the next post.

24 jun. 2019

Validation problem (extrapolation)

Sometime when validating a product for a certain constituent (in this case dry matter) we can see this type of X-Y plot:


This a not nice at all validation, but we have to see first that we have like to clusters of lab values for lower and higher dry matter. So the first question is:
Which is the range of the calibration samples in the model which I am validating?.

I check and I see that the range for dry matter  in the model is from 78,700 to 86,800, so I am validating with samples more dried than the ones in the calibration.

I see that it seems like bias effect for those samples. Let´s remove the samples in range and check the statistics for the samples out of range:

We see that we have a bias effect, and some slope caused but one of the samples. So this is a new source of variation to expand the calibration. Merge the validation samples to the database and recalibrate. Try to make robust the new model for extrapolation.

3 jun. 2019

18 may. 2019

set.seed function in R and also in Win ISI

It is common to see how at the beginning of some code the "set.feed" function is fixed to a number. The idea of this is to get reproducible results when working with functions which require random sample generation. This is the case for example in Artificial Neural Networks models where the weights are selected randomly at the beginning and after that are changing during the learning process.

Let´s see what happens if set.seed() is not used:
library(nnet)
data(airquality)

model=nnet( Ozone~Wind, airquality  size=4, linout=TRUE )

The results for the weights are:

# weights:  13
initial  value 340386.755571
iter  10 value 125143.482617
iter  20 value 114677.827890
iter  30 value 64060.355881
iter  40 value 61662.633170
final  value 61662.630819
converged

 
If we repeat again the same process:

model=nnet( Ozone~Wind, airquality  size=4, linout=TRUE )

The results for the weights are different:

# weights:  13
initial  value 326114.338213
iter  10 value 125356.496387
iter  20 value 68060.365524
iter  30 value 61671.200838
final  value 61662.628120
converged
 

 
But if we fit the seed to a certain value (whichever you like) .

set.seed(1)
model=nnet( Ozone~Wind, airquality  size=4, linout=TRUE )

 # weights:  13
initial  value 336050.392093
iter  10 value 67199.164471
iter  20 value 61402.103611
iter  30 value 61357.192666
iter  40 value 61356.342240
final  value 61356.324337
converged

 
and repeat the code with the same seed:

set.seed(1)
model=nnet( Ozone~Wind, airquality  size=4, linout=TRUE )

we obtain the same results:

# weights:  13
initial  value 336050.392093
iter  10 value 67199.164471
iter  20 value 61402.103611
iter  30 value 61357.192666
iter  40 value 61356.342240
final  value 61356.324337
converged


SET.SEED es used in Chemometric Programs as Win ISI to select samples randomly:

2 may. 2019

Using "tecator" data with Caret (part 4)

I add one more type of regression to the "tecator meat data" in this case is the "Ridge Regression".
Ridge Regression use all the predictors, but penalizes their values in order they can not get high values.

We can see that it not get such as best fitting as the PCR or PLS in the case of spectroscopy data, but it is quite common to use it in other data for Machine Learning Application. Ridge Regression is a type of Regularization where we have two types L1 and L2.

In the plot you can see also the RMSE for the validation set:

Of course PLS works better, but we must try other models and see how the affect to the values.

30 abr. 2019

What are the benefits of adding more data to the models?

One of the frequent questions before developing a calibration is: How many samples are necessary to develop a calibration?. The quick answer is: ¡as much as possible!. Of course is obvious that they should content variability and represent as much as possible the new data can appear in the future.
 
The main sources of error are the "Irreducible error" (error from the noise of the instrument itself), the unexplained error (variance) and the Bias and they follow some rules, depending of the number of samples we have. Another thing to take into account is the complexity of the model (the number of coefficients, parameters, or terms we add to the regression).
 
Let´s look to this plot:
Now, if we add more samples tis lines are keep them as dash lines and the Bias, Variance and Total Error improves but the complexity (vertical black line) increase, and this is normal.