The first post of this tutorial brings a nice
conversation on tweeter. The idea now, is to use Caret with other packages (necessary for some of the prepocessing of the data), but I will work
in parallel at the same time with "tidymodels" packages (a good way
to learn how to use them).
Thanks to Max Kuhn for send me a link to some
interesting code from James Wade to use the Meat data in the tidyverse and
tidymodels environment. By the way the great book "Applied PredictiveModelling" (Max Kuhn & Kjell Johnson) is very useful and this tutorial
start as a possible solution for the exercise 6.1 in the book.
One of the problems to work with NIR or NIT spectra is the high collinearity of the predictors, this is due to NIR (Near Infrared Reflectance) and NIT (Near Infrared Transmitance) is formed by overtones (first, second, third) and combination fundamental bands which appear in the MIR (Middle Infrared). Other problem in this type of spectra is the high level of overlaping, and the NIR or NIT contains a lot of hidden information which needs from statistic or mathematical treatments to make them visible in a way that is usefull to develop models. This requires the use of preprocessing methods (centering, scaling, resolution improvement,...), that will be treated along the coming posts.
By the moment we will deal with the spectra without any treatment ("raw spectra"), and we create a correlation matrix to check the collinearity.
correlation <- cor(absorp)library(corrplot)
corrplot(correlation)
data_split <- createDataPartition(endpoints[ ,1], p = .75)
data_split <- data_split$Resample1
# split data
absorp_train <- absorp[data_split, ]
absorp_test <- absorp[-data_split, ]
- Center: The mean spectrum is sustracted from every spectrum.
- Scale : Every spectrum data point is divided by the standard deviation of all the data points in the spectrum.
We can see how center and scale affect to the spectra:
train_scaled <- scale(absorp_train, center = TRUE,
matplot(seq(850, 1048, by = 2), t(train_scaled),
xlab = "Wavelengths", ylab = "Absorbance",
main = "Meat spectra", type = "l")
These pretreatments are aplyed to every spectrum individually, so at the end of the preprocess (before developing the PCA) we have a transformed data matrix with the same dimensions (215 . 100).
pca_object <- prcomp(absorp_train, center = TRUE,
percent_variance <- pca_object$sdev^2/sum(pca_object$sd^2)*100
plot(percent_variance[1:20])
head(percent_variance)
head(pca_object$x[1:5 , 1:2 ]) #scores with just 2 terms
wavelengths <- seq(850, 1048, by = 2)
colnames(absorp_train) <- wavelengths
colnames(absorp_train)
trans <- preProcess(absorp_train,
trans #info about the PCA calculation
transformed <- predict(trans, absorp_train)
head(transformed[1:2 , ])
No hay comentarios:
Publicar un comentario