R & Chemometrics: Modelling complex spectral data (soil) with the resemble package (VIII)

31 oct 2021

Modelling complex spectral data (soil) with the resemble package (VIII)

This is the post number 8 about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) ", where we try to understand the way Resemble package works, and to use their functions to analyze complex products as the case of soil.

We have seen how to calculate the dissimilarity matrix for a sample set in the orthogonal space using the Mahalanobis distance, but there are other calculation methods for dissimilarities.

The simplest one can be the correlation method, where we calculate the correlation between every sample (spectrum) of a sample set and all the rest (spectra), for example of a training set, but we can calculate, as well, the correlation of every sample of the test set vs. the samples of the training set, or even of a new unknown sample spectrum vs all the spectra of the training set. This way we can define a threshold and select samples over a certain correlation value to do something special with them (as for example a quantitative model).

The vignette show the code to calculate the dissimilarity matrix for the training set:

cd_tr <- dissimilarity(Xr = training$spc_p, diss_method = "cor")
dim(cd_tr$dissimilarity)
cd_tr$dissimilarity

As in the case oh the Mahalanobis distance, the matrix has the same size, so it is square and diagonal. We can check the distribution of correlations between any sample (in this case the first one) and the rest in a histogram:

hist(cd_tr$dissimilarity[,1], breaks=50)

We can do the same for the test set:

cd_tr_ts <- dissimilarity(Xr = training$spc_p,

Xu = testing$spc_p,

diss_method = "cor")

dim(cd_tr_ts$dissimilarity)

hist(cd_tr_ts$dissimilarity[,1], breaks=50)

Other correlation method to calculate is with a moving window of a certain size (a certain number of data points), so for every sample we have several correlations (total number of data points divided by the size of the window) and calculate the average.

In the case of the first sample of the test set, we can see in a scatter plot which method find higher correlated samples from the training set:

cd_mw <- dissimilarity(Xr = training$spc_p,

Xu = testing$spc_p,

diss_method = "cor",

ws = 19)

#cd_mw$dissimilarity

hist(cd_mw$dissimilarity[,1], breaks=50)

plot(cd_mw$dissimilarity[,1], ylim = c(0,1),

ylab = "corr. diss.", col = "blue")

par(new=TRUE)

plot(cd_tr_ts$dissimilarity[,1],ylim = c(0,1),

ylab = " ", col = "red")

legend(x = "right", col=c("blue", "red"),

pch =20, legend = c("corr.", "window corr."))

As we can see the window size find higher correlations between the first sample of the test set and the training samples, but it is different for the rest. See the scatter plot for the 70th test sample:

There are less samples with high correlation and quite a lot with almost no correlation.

R & Chemometrics

31 oct 2021

Modelling complex spectral data (soil) with the resemble package (VIII)

No hay comentarios:

Publicar un comentario