27 sept 2022

GLOSOLAN Soil Spectroscopy Webinar #11 | Soil sample preparation for MIR...

New Quarto blog for NIR and Chemometrics

 

I have created a new blog with Quarto, to work in R and R-Studio with spectroscopy data available for Near Infrared . By the moment, a serial of posts are coming about Near Infrared Transmitance with Meat data. It is a good ocasion to start learning from the beginning to work with chemometrics in R with this type of spectra


https://jrcuesta.quarto.pub/nir-chemometrics/

20 sept 2022

NIT Spectroscopic Tutorial with Caret (part 5)

Working these days with the tecator data set traying to find the better statistics as possible for the cross validation, and checking after with the test set. I use algorithms to remove the scatter but at the moment not derivatives to enhance the resolution. I show the statistics results for the CV and the validation results for the test set.

Multiple Scatter Correction has at the moment the better statistics for protein, but still some combinations of Math treatments could be tryed.


This is the XY plot for the test set with the pls model using the train set math treated with MSC.



16 sept 2022

NIT Spectroscopic Tutorial with Caret (part 4)

Looking to the spectra of the previous post was clear that we must apply a pre-treatment to remove the baselines shifts. One way to do it is to treat every spectrum individually, calculating a linear model (absorbance values vs. wavelengths) and getting the values of the slope at every wavelength, after these values are subtracted from the absorption values of the spectrum getting the baseline corrected spectrum. We repeat the process for every spectrum.

This function is available in the package pracma, and its name is detrend. In case we use it for the spectra pre-process in Caret with “center” and “scale”, we change the view from this:

 


To this:

 


This way we can see more clearly the variation in the spectra that we can try to interpret some way. Also, with this new pre-treatment we can check if the PCA change some way, and better to find outliers.

One of the options for interpretation are the correlation spectrum (see the correlation of every wavelength with the parameter of interest). In this case we can see how fat is inverse correlated with protein and moisture, and there are some positive correlations between protein and moisture.


We can see this if we run the correlation plot for the parameters;
correlation_par <- cor(endpoints)
library(corrplot)
corrplot(correlation_par, method = "ellipse")




See the code in basic R for the correlation spectra:

cor_moi <- cor(endpoints_train[ , 1], 
               train_scaled_fit)
cor_fat <- cor(endpoints_train[ , 2], 
               train_scaled_fit)
cor_prot <- cor(endpoints_train[ , 3], 
                train_scaled_fit)

matplot(seq(850, 1048, by = 2), t(cor_moi),
        xlab = "Wavelengths", ylab = "Absorbance", 
        main = "Correlation Spectra", type = "l", 
        ylim = c(-1, 1))

par(new = TRUE)

matplot(seq(850, 1048, by = 2), t(cor_fat),
        xlab = " ", ylab = " ", 
        main = " ", type = "l", col = "red", 
        ylim = c(-1, 1))

par(new = TRUE)

matplot(seq(850, 1048, by = 2), t(cor_prot),
        xlab = " ", ylab = " ", 
        main = " ", type = "l", col = "blue", 
        ylim = c(-1, 1))


legend("topright",                    
       legend = c("Moisture", "Fat", "Protein"),
       col = c("black", "red", "blue"),
       lty = 1)

12 sept 2022

NIT Spectroscopic Tutorial with Caret (part 3)

To follow the tutorial yo can see first:
We use two pretreatments in the previous posts to develop the Principal Components Analysis, but we can add another to remove the skewness of the predictor variables. We can see the skewness plotting a histogram of the absorptions at every wavelength, there will be 100 histograms to look, so the best way to check it can be a boxplot spectra (centred and scaled):

boxplot(train_scaled, main = "preProcess(Center + Scale)")

In the plot we can see that are skewness to the right, and some outliers.

Now we can apply the BoxCox algorithm together with Center and Scale and to look at the boxplot spectra:

train_scaled_2 <- preProcess(absorp_train,
                     method = c("BoxCox", "center", "scale"))

boxplot(train_scaled_2, 
        main = "preProcess(BoxCox + Center + Scale)")

The result shows how the skewness is removed. Of course, the PCA calculation will give different scores values, but still two terms will explain almost all the variance.

Looking to the spectra (treated either with “Center” and “Scale”, or “BoxCox”, “Center” and “Scale”) we see that there is a baseline shift that would be convenient that a pretreatment will remove in case we have to see clearly the variability of the fat, protein or moisture content.

6 sept 2022

NIT Spectroscopic Tutorial with Caret (part 2)

The first post of this tutorial brings a nice conversation on tweeter. The idea now, is to use Caret with other packages (necessary for some of the prepocessing of the data), but I will work in parallel at the same time with "tidymodels" packages (a good way to learn how to use them).

Thanks to Max Kuhn for send me a link to some interesting code from James Wade to use the Meat data in the tidyverse and tidymodels environment. By the way the great book "Applied PredictiveModelling" (Max Kuhn & Kjell Johnson) is very useful and this tutorial start as a possible solution for the exercise 6.1 in the book.

One of the problems to work with NIR or NIT spectra is the high collinearity of the predictors, this is due to NIR (Near Infrared Reflectance) and NIT (Near Infrared Transmitance) is formed by overtones (first, second, third) and combination fundamental bands which appear in the MIR (Middle Infrared). Other problem in this type of spectra is the high level of overlaping, and the NIR or NIT contains a lot of hidden information which needs from statistic or mathematical treatments to make them visible in a way that is usefull to develop models. This requires the use of preprocessing methods (centering, scaling, resolution improvement,...), that will be treated along the coming posts.

By the moment we will deal with the spectra without any treatment ("raw spectra"), and we create a correlation matrix to check the collinearity.

correlation <- cor(absorp)
library(corrplot)
corrplot(correlation)




As we can see, all the wavelengths are highly correlated, so we have to reduce some way the number of variables (predictors), looking for other variables,  not correlated and which capture most the variability contained in the spectra matrix without introducing noise. This can be done with the principal component analysis (PCA), and depending on the number of samples we have, we can use as many new variables (we will call them "terms") as predictors, but this will never be the case and we have to decide how many terms we will keep in order not to loose information (underfitting) and do not introduce noise (overfitting).


Before to run the principal component analysis we divide the whole data set into a train set and a test set.

Creating a training/test split: Normally we use 75% random samples for the training set and the remaining 25% for the test set

set.seed(1234)
#Create partition index
data_split <- createDataPartition(endpoints[ ,1], p = .75)
data_split <- data_split$Resample1

# split data
absorp_train <- absorp[data_split, ]
absorp_test <- absorp[-data_split, ]



The test set is taken apart and won´t be used until necessary (by the end of the models development). Refuse to use it to take decisions, it must be totally independent.

When running the PCA we have the option to use some pretreatments

  • Center: The mean spectrum is sustracted from every spectrum.
  • Scale  :  Every spectrum data point is divided by the standard deviation of all the data points in the spectrum.

We can see how center and scale affect to the spectra:

train_scaled <- scale(absorp_train, center = TRUE, 
                     scale = TRUE )
matplot(seq(850, 1048, by = 2), t(train_scaled),
        xlab = "Wavelengths", ylab = "Absorbance",
        main = "Meat spectra", type = "l")




These pretreatments are aplyed to every spectrum individually, so at the end of the preprocess (before developing the PCA) we have a transformed data matrix with the same dimensions (215 . 100).

pca_object <- prcomp(absorp_train, center = TRUE, 
                     scale. = TRUE)
percent_variance <- pca_object$sdev^2/sum(pca_object$sd^2)*100
plot(percent_variance[1:20])
head(percent_variance)
head(pca_object$x[1:5 , 1:2 ])
#scores with just 2 terms



We can see that just a few terms ( two or three) are enough to retain almost all the variance, but let´s calculate this time with Caret the principal components. First we have to give column names to the absorp_train matrix:

wavelengths <- seq(850, 1048, by = 2)
colnames(absorp_train) <- wavelengths
colnames(absorp_train)

trans <- preProcess(absorp_train, 
                    method = c("center", "scale", "pca"))
trans #info about the PCA calculation
transformed <- predict(trans, absorp_train)
head(transformed[1:2 , ]) 
#scores with the 2 recommended terms (95% variance explained)

by default the threshold for the maximum explained variance variance is 95%, but we can increase it if necessary. (in the coming post we will apply other methods to see if applying a correction to the skewness of the predictors improves the results).

Both methods give the same score and loading matrix that we will interpret in a coming post.

5 sept 2022

NIT Spectroscopic Tutorial with Caret (part 1)

Along several posts we will practice the use of Caret with spectroscopic data from an Infratec Instrument. The first thing we will do, is to load the library and the data which contain the spectroscopic and reference values.

#Loading the data
    library(caret)
    data(tecator)
    ?caret::tecator #Details about the tecator data

Two tables are loaded, one with the spectroscopic data (absorp), and another with the reference values (endpoints).

        dim(endpoints)
   dim(absorp)

We can see that we have 215 samples measured at 100 data points in the spectroscopic data, and three parameters with reference values.

The data points goes from 850 to 1048 every two nm, so they are in total 100 data points, and the reference values are for Moisture, Fat and Protein. As we can read in the Help info, the samples are chopped pure meat measured in transmittance.

We can see first the structure of the data is numeric

    str(absorp)
  str(endpoints)

and their class (matrix)

    class(absorp)
    class(endpoints)

We have a matrix for the absorption values and a matrix for the reference values. Let´s see the raw spectra

    matplot(seq(850, 1048, by = 2), t(absorp),
                 xlab = "Wavelengths", ylab = "Absorbance",
                 main = "Meat spectra", type = "l")

As we can see, due to the particle size, differences in the pathlength and other physical circumstances, there is quite a lot scatter in the spectra, so this physical variables are with others (chemical) data that we will have to treat and manage to develop models.