6 sept 2022

NIT Spectroscopic Tutorial with Caret (part 2)

The first post of this tutorial brings a nice conversation on tweeter. The idea now, is to use Caret with other packages (necessary for some of the prepocessing of the data), but I will work in parallel at the same time with "tidymodels" packages (a good way to learn how to use them).

Thanks to Max Kuhn for send me a link to some interesting code from James Wade to use the Meat data in the tidyverse and tidymodels environment. By the way the great book "Applied PredictiveModelling" (Max Kuhn & Kjell Johnson) is very useful and this tutorial start as a possible solution for the exercise 6.1 in the book.

One of the problems to work with NIR or NIT spectra is the high collinearity of the predictors, this is due to NIR (Near Infrared Reflectance) and NIT (Near Infrared Transmitance) is formed by overtones (first, second, third) and combination fundamental bands which appear in the MIR (Middle Infrared). Other problem in this type of spectra is the high level of overlaping, and the NIR or NIT contains a lot of hidden information which needs from statistic or mathematical treatments to make them visible in a way that is usefull to develop models. This requires the use of preprocessing methods (centering, scaling, resolution improvement,...), that will be treated along the coming posts.

By the moment we will deal with the spectra without any treatment ("raw spectra"), and we create a correlation matrix to check the collinearity.

correlation <- cor(absorp)
library(corrplot)
corrplot(correlation)




As we can see, all the wavelengths are highly correlated, so we have to reduce some way the number of variables (predictors), looking for other variables,  not correlated and which capture most the variability contained in the spectra matrix without introducing noise. This can be done with the principal component analysis (PCA), and depending on the number of samples we have, we can use as many new variables (we will call them "terms") as predictors, but this will never be the case and we have to decide how many terms we will keep in order not to loose information (underfitting) and do not introduce noise (overfitting).


Before to run the principal component analysis we divide the whole data set into a train set and a test set.

Creating a training/test split: Normally we use 75% random samples for the training set and the remaining 25% for the test set

set.seed(1234)
#Create partition index
data_split <- createDataPartition(endpoints[ ,1], p = .75)
data_split <- data_split$Resample1

# split data
absorp_train <- absorp[data_split, ]
absorp_test <- absorp[-data_split, ]



The test set is taken apart and won´t be used until necessary (by the end of the models development). Refuse to use it to take decisions, it must be totally independent.

When running the PCA we have the option to use some pretreatments

  • Center: The mean spectrum is sustracted from every spectrum.
  • Scale  :  Every spectrum data point is divided by the standard deviation of all the data points in the spectrum.

We can see how center and scale affect to the spectra:

train_scaled <- scale(absorp_train, center = TRUE, 
                     scale = TRUE )
matplot(seq(850, 1048, by = 2), t(train_scaled),
        xlab = "Wavelengths", ylab = "Absorbance",
        main = "Meat spectra", type = "l")




These pretreatments are aplyed to every spectrum individually, so at the end of the preprocess (before developing the PCA) we have a transformed data matrix with the same dimensions (215 . 100).

pca_object <- prcomp(absorp_train, center = TRUE, 
                     scale. = TRUE)
percent_variance <- pca_object$sdev^2/sum(pca_object$sd^2)*100
plot(percent_variance[1:20])
head(percent_variance)
head(pca_object$x[1:5 , 1:2 ])
#scores with just 2 terms



We can see that just a few terms ( two or three) are enough to retain almost all the variance, but let´s calculate this time with Caret the principal components. First we have to give column names to the absorp_train matrix:

wavelengths <- seq(850, 1048, by = 2)
colnames(absorp_train) <- wavelengths
colnames(absorp_train)

trans <- preProcess(absorp_train, 
                    method = c("center", "scale", "pca"))
trans #info about the PCA calculation
transformed <- predict(trans, absorp_train)
head(transformed[1:2 , ]) 
#scores with the 2 recommended terms (95% variance explained)

by default the threshold for the maximum explained variance variance is 95%, but we can increase it if necessary. (in the coming post we will apply other methods to see if applying a correction to the skewness of the predictors improves the results).

Both methods give the same score and loading matrix that we will interpret in a coming post.

No hay comentarios:

Publicar un comentario