NIR-Chemometrics - Near Infrared Transmitance Tutorial (part 2) (quarto.pub)
28 sept 2022
27 sept 2022
New Quarto blog for NIR and Chemometrics
I have created a new blog with Quarto, to work in R and R-Studio with spectroscopy
data available for Near Infrared . By the moment, a serial of posts are coming
about Near Infrared Transmitance with Meat data. It is a good ocasion to start
learning from the beginning to work with chemometrics in R with this type of
spectra
20 sept 2022
NIT Spectroscopic Tutorial with Caret (part 5)
Working these days with the tecator data set traying to find the better statistics as possible for the cross validation, and checking after with the test set. I use algorithms to remove the scatter but at the moment not derivatives to enhance the resolution. I show the statistics results for the CV and the validation results for the test set.
Multiple Scatter Correction has at the moment the better statistics for protein, but still some combinations of Math treatments could be tryed.
This is the XY plot for the test set with the pls model using the train set math treated with MSC.
16 sept 2022
NIT Spectroscopic Tutorial with Caret (part 4)
Looking to the spectra of the previous post was clear
that we must apply a pre-treatment to remove the baselines shifts. One way to
do it is to treat every spectrum individually, calculating a linear model
(absorbance values vs. wavelengths) and getting the values of the slope at
every wavelength, after these values are subtracted from the absorption values
of the spectrum getting the baseline corrected spectrum. We repeat the process
for every spectrum.
This function is available in the package pracma, and
its name is detrend. In case we use it for the spectra pre-process in Caret
with “center” and “scale”, we change the view from this:
To this:
This way we can see more clearly the variation in the
spectra that we can try to interpret some way. Also, with this new pre-treatment
we can check if the PCA change some way, and better to find outliers.
One of the options for interpretation are the
correlation spectrum (see the correlation of every wavelength with the
parameter of interest). In this case we can see how fat is inverse
correlated with protein and moisture, and there are some positive correlations
between protein and moisture.
We can see this if we run the correlation plot for the parameters;
12 sept 2022
NIT Spectroscopic Tutorial with Caret (part 3)
boxplot(train_scaled,
main = "preProcess(Center + Scale)")
In the plot we can see that are skewness to the right, and some outliers.
Now we
can apply the BoxCox algorithm together with Center and Scale and to look at
the boxplot spectra:
method = c("BoxCox", "center", "scale"))
boxplot(train_scaled_2,
The result shows how the skewness is removed. Of course, the PCA calculation will give different scores values, but still two terms will explain almost all the variance.
Looking to the spectra (treated either with “Center” and “Scale”, or “BoxCox”, “Center” and “Scale”) we see that there is a baseline shift that would be convenient that a pretreatment will remove in case we have to see clearly the variability of the fat, protein or moisture content.
6 sept 2022
NIT Spectroscopic Tutorial with Caret (part 2)
The first post of this tutorial brings a nice
conversation on tweeter. The idea now, is to use Caret with other packages (necessary for some of the prepocessing of the data), but I will work
in parallel at the same time with "tidymodels" packages (a good way
to learn how to use them).
Thanks to Max Kuhn for send me a link to some
interesting code from James Wade to use the Meat data in the tidyverse and
tidymodels environment. By the way the great book "Applied PredictiveModelling" (Max Kuhn & Kjell Johnson) is very useful and this tutorial
start as a possible solution for the exercise 6.1 in the book.
One of the problems to work with NIR or NIT spectra is the high collinearity of the predictors, this is due to NIR (Near Infrared Reflectance) and NIT (Near Infrared Transmitance) is formed by overtones (first, second, third) and combination fundamental bands which appear in the MIR (Middle Infrared). Other problem in this type of spectra is the high level of overlaping, and the NIR or NIT contains a lot of hidden information which needs from statistic or mathematical treatments to make them visible in a way that is usefull to develop models. This requires the use of preprocessing methods (centering, scaling, resolution improvement,...), that will be treated along the coming posts.
By the moment we will deal with the spectra without any treatment ("raw spectra"), and we create a correlation matrix to check the collinearity.
correlation <- cor(absorp)library(corrplot)
corrplot(correlation)
data_split <- createDataPartition(endpoints[ ,1], p = .75)
data_split <- data_split$Resample1
# split data
absorp_train <- absorp[data_split, ]
absorp_test <- absorp[-data_split, ]
- Center: The mean spectrum is sustracted from every spectrum.
- Scale : Every spectrum data point is divided by the standard deviation of all the data points in the spectrum.
We can see how center and scale affect to the spectra:
train_scaled <- scale(absorp_train, center = TRUE,
matplot(seq(850, 1048, by = 2), t(train_scaled),
xlab = "Wavelengths", ylab = "Absorbance",
main = "Meat spectra", type = "l")
These pretreatments are aplyed to every spectrum individually, so at the end of the preprocess (before developing the PCA) we have a transformed data matrix with the same dimensions (215 . 100).
pca_object <- prcomp(absorp_train, center = TRUE,
percent_variance <- pca_object$sdev^2/sum(pca_object$sd^2)*100
plot(percent_variance[1:20])
head(percent_variance)
head(pca_object$x[1:5 , 1:2 ]) #scores with just 2 terms
wavelengths <- seq(850, 1048, by = 2)
colnames(absorp_train) <- wavelengths
colnames(absorp_train)
trans <- preProcess(absorp_train,
trans #info about the PCA calculation
transformed <- predict(trans, absorp_train)
head(transformed[1:2 , ])
5 sept 2022
NIT Spectroscopic Tutorial with Caret (part 1)
Along several posts we will practice the use of Caret with spectroscopic data from an Infratec Instrument. The first thing we will do, is to load the library and the data which contain the spectroscopic and reference values.
#Loading the datadata(tecator)
?caret::tecator #Details about the tecator data
Two tables are loaded, one with the spectroscopic data (absorp), and another with the reference values (endpoints).
dim(endpoints)dim(absorp)
We can see that we have 215 samples measured at 100 data points in the spectroscopic data, and three parameters with reference values.
The data points goes from 850 to 1048 every two nm, so they are in total 100 data points, and the reference values are for Moisture, Fat and Protein. As we can read in the Help info, the samples are chopped pure meat measured in transmittance.
We can see first the structure of the data is numeric
str(absorp)str(endpoints)
and their class (matrix)
class(absorp)class(endpoints)
We have a matrix for the absorption values and a matrix for the reference values. Let´s see the raw spectra
matplot(seq(850, 1048, by = 2), t(absorp),
xlab = "Wavelengths", ylab = "Absorbance",
main = "Meat spectra", type = "l")
As we can see, due to the particle size, differences in the pathlength and other physical circumstances, there is quite a lot scatter in the spectra, so this physical variables are with others (chemical) data that we will have to treat and manage to develop models.