31 oct 2021
3 GSIF course: Structures and functions for soil point data
Modelling complex spectral data (soil) with the resemble package (VIII)
This is the post number 8 about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) ", where we try to understand the way Resemble package works, and to use their functions to analyze complex products as the case of soil.
We have seen how to calculate the dissimilarity matrix for a sample set in the orthogonal space using the Mahalanobis distance, but there are other calculation methods for dissimilarities.
The simplest one can be the correlation method, where we calculate the correlation between every sample (spectrum) of a sample set and all the rest (spectra), for example of a training set, but we can calculate, as well, the correlation of every sample of the test set vs. the samples of the training set, or even of a new unknown sample spectrum vs all the spectra of the training set. This way we can define a threshold and select samples over a certain correlation value to do something special with them (as for example a quantitative model).
The vignette show the code to calculate the dissimilarity matrix for the training set:
cd_tr <- dissimilarity(Xr = training$spc_p, diss_method = "cor")dim(cd_tr$dissimilarity)
cd_tr$dissimilarity
cd_mw <- dissimilarity(Xr = training$spc_p,
29 oct 2021
Webinar - A Comparison of VNIR and MIR Spectroscopy
24 oct 2021
Modelling complex spectral data (soil) with the resemble package (VII)
This is the number 7 of the posts about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) ", where we try to understand the way Resemble package works, and to use their functions to analyze complex products as the case of soil.
We continue from post 6, were we saw how the dissimilarity matrix is calculated in the principal component space (orthogonal space), and how in that space we can project new samples to get their scores, and calculate the dissimilarity matrix between the training sample set and the test sample set.
Now the idea is to select for every sample, from the test set, a certain number of training samples which are neighbors of that sample defining how many neighbors to choose (by "knn"). These selected training samples are taken apart and a new principal component space is calculated, calculating a new dissimilarity matrix with new values for the distances . In this new dissimilarity matrix will have NA values for the training samples which are not selected.
This is the part of the vignette called: "Combine k-nearest neighbors and dissimilarity measures in the orthogonal space" where, different examples choosing a "knn" value of 200, are developed using different PCA methods, so you can practice.
In the case of the first test set sample the histogram of the neighbors distances between of the sample itself and the rest of the training samples is:
we have chosen one test sample which is quite apart from the majority of the training samples , but you can try with other test set samples and you get different distributions.
Taking apart the 200 most closer samples and developing a new PCA, the NH distances of this sample to the rest is:
20 oct 2021
GLOSOLAN Soil Spectroscopy Webinar #1
19 oct 2021
Modelling complex spectral data (soil) with the resemble package (VI)
Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)
Imagine that just two Principal Components would be enough to explain 99% of the variance, so we can see the samples in a unique plane. It is easy to calculate the distance between a sample and all the rest, just drawing lines and calculating their distance. After that we can write their value in a matrix where the diagonal would be cero (distance between the sample and itself). In this case, it is the training set so we have 618 samples (618 dots) and the matrix would be a matrix with 618 rows and 618 columns (618x618).
We can see cases where the samples are very close (blue circles), so their neighbor distance is very small (very low values), and we can consider (we saw as well in previous post) that their constituents’ values would be very similar.
In the case that we
have more components to explain the variance (11 as we saw in previous post),
the dimension of the matrix would be the same (618x618), but the distances
would be not in a plane, if not in a multidimensional space.
This matrix is called "dissimilarity matrix" in the vignette, and has a great importance in the development of calculations and algorithms.
18 oct 2021
17 oct 2021
Modelling complex spectral data (soil) with the resemble package (V)
Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)
Continuing from last post: Once we have calculated the PC terms or components (11
in the case of the last PCA analysis using the method OPC), we define planes
defined by the combinations of two of those terms (for example: PC1-PC2,
PC2-PC3, PC1-PC3,…), and the training spectra is projected on the plane to get
the scores of every spectrum vs. each PC component. All those scores are kept
in a score matrix “T”. All the projections form a cloud that in the case of
just two terms would be a 2D cloud, making easy the interpretations of the
distances between every sample and the mean or their neighbors. But in the case
of more dimensions it is a multivariate cloud, making the visual inspection
more difficult, so we have to check the projections individually in 2D planes
or 3D planes.
Algorithms like the Mahalanobis distance to the mean or to the neighbors will help us to check if the sample can be an outlier, it has very close neighbors (so it is represented by samples in theory similar), or if the sample has not closer neighbors and is a good sample to improve the structure of the database and make it more robust.
Let´s see in the case of the previous code one of those score planes, the one formed by the PC1 and PC2 terms:
plot(pca_tr_opc$scores[,1],pca_tr_opc$scores[,2],ylim = c(min(pca_tr_opc$scores[,2]),
We can project the testing data on the same plane, getting the scores of
the samples:
pca_projected <- predict(pca_tr_opc, newdata = testing$spc_p)
par(new=TRUE)
plot(pca_projected[,1],pca_projected[,2], col = "red",ylim = c(min(pca_tr_opc$scores[,2]),
xlab=" ", ylab=" ")
plot_ly(T_training, x=~T_training[,1], y=~T_training[,2],
z=~T_training[,3], alpha = 0.7)
14 oct 2021
Modelling complex spectral data (soil) with the resemble package (IV)
Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)
Now we will use the PCA with the method “opc” in order to find the optimal number of components, bases on the its rationale behind that if two spectra are close in the X space (near neighbors), their constituents values will be closer as well on its value, so the optimal number of components will be the one that makes minimum the RMSD (root mean square difference) between them.
For more details you can find more info from the developers of this algorithm : L. Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. (2013)pca_tr_opc <- ortho_projection
Yr = training$Ciso,
method = "pca",
pc_selection = optimal_sel)
pca_tr_opc # to obtain details of the PCA calculations.
We specify a maximum
value of 40, and the “opc” method estimate that 11 is the best option. If we
plot it, we can see graphically the reason:
The vignette shows
an interesting code, that if you run it will get the XY plot of the reference Ciso
value (for every spectrum) and the reference Ciso value for its closer neighbor,
and we can se a high correlation what is really the idea behind the “opc”
method.
13 oct 2021
Modelling complex spectral data (soil) with the resemble package (III)
Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)
(from previous post) - These 825 samples are divided in two sets, one for training (value equal 1 in the variable train) and another for testing (value = 0).count(train)
train n
0 207
1 618
Let´s create these two dataframes:
testing<- NIRsoil[NIRsoil$train == 0, ]
11 oct 2021
Modelling complex spectral data (soil) with the resemble package (II)
Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux), now it is time to see the predictor variables which are reflectance values of the soil samples acquired in a NIR (Near Infrared Reflectance) instrument in the range from 1100 to 2498 nm in two nm steps, so we have 700 data points. We prepare a vector with the wavelengths and we call it "wav" (same as the vignette).
wavs<-NIRsoil$spc %>% colnames() %>% as.numeric()
Now we can se to the raw spectra (spectra without any treatment):
matplot(x = wavs, y = t(NIRsoil$spc),ylab = "Absorbance", type = "l",
Let´s create a new vector considering the wavelength reduction:
new_wavs <- as.matrix(as.numeric(colnames(NIRsoil$spc_p)))
and plot the spectra to see their appearance:
matplot(x = new_wavs, y = t(NIRsoil$spc_p),xlab = "Wavelengths, nm",
ylab = "1st derivative",
type = "l", lty = 1, col = "#5177A133")
Now in the data frame "NIRsoil" we have two spectra matrices, the raw spectra (spc) and the spectra reduced and math treated with the SG first derivative (spc_p).
We can check the dimensions of these matrices:
names(NIRsoil)"Nt" "Ciso" "CEC" "train" "spc" "spc_p"
dim(NIRsoil$spc)
825 700
dim(NIRsoil$spc_p)
825 276
In the next post we will continue the preprocessing process and preparation of the data as the vignette suggest, trying to understand the different procedures to model, as better as possible, the soil spectral data.
7 oct 2021
Modelling complex spectral data (soil) with the resemble package (I)
Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)
From certain time I am interested in the use of NIR spectra to develop models, and to follow this tutorial can help me to understand better how to apply several type of regressions and to see the performance for this complex matrix like soil.
Let´s create a new R Markdown file (.Rmd), and load the three libraries that we will use:
library(tidyverse)
library(resemble)
library(prospectr)
library(magrittr)
data("NIRsoil")
NIRsoil$Nt %>%
summary()
NIRsoil%>%
ggplot(aes(Nt)) +
geom_histogram()
It is important to check how the response variables correlate between them:
response<- NIRsoil[ , 1:3] %>%
drop_na()
corrplot(cor(response), method = "number")
response %>%
ggplot(aes(x = Ciso , y = Nt)) +
geom_point()