R & Chemometrics: Resamble

Mostrando entradas con la etiqueta Resamble. Mostrar todas las entradas

31 oct 2021

Modelling complex spectral data (soil) with the resemble package (VIII)

This is the post number 8 about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) ", where we try to understand the way Resemble package works, and to use their functions to analyze complex products as the case of soil.

We have seen how to calculate the dissimilarity matrix for a sample set in the orthogonal space using the Mahalanobis distance, but there are other calculation methods for dissimilarities.

The simplest one can be the correlation method, where we calculate the correlation between every sample (spectrum) of a sample set and all the rest (spectra), for example of a training set, but we can calculate, as well, the correlation of every sample of the test set vs. the samples of the training set, or even of a new unknown sample spectrum vs all the spectra of the training set. This way we can define a threshold and select samples over a certain correlation value to do something special with them (as for example a quantitative model).

The vignette show the code to calculate the dissimilarity matrix for the training set:

cd_tr <- dissimilarity(Xr = training$spc_p, diss_method = "cor")
dim(cd_tr$dissimilarity)
cd_tr$dissimilarity

As in the case oh the Mahalanobis distance, the matrix has the same size, so it is square and diagonal. We can check the distribution of correlations between any sample (in this case the first one) and the rest in a histogram:

hist(cd_tr$dissimilarity[,1], breaks=50)

We can do the same for the test set:

cd_tr_ts <- dissimilarity(Xr = training$spc_p,

Xu = testing$spc_p,

diss_method = "cor")

dim(cd_tr_ts$dissimilarity)

hist(cd_tr_ts$dissimilarity[,1], breaks=50)

Other correlation method to calculate is with a moving window of a certain size (a certain number of data points), so for every sample we have several correlations (total number of data points divided by the size of the window) and calculate the average.

In the case of the first sample of the test set, we can see in a scatter plot which method find higher correlated samples from the training set:

cd_mw <- dissimilarity(Xr = training$spc_p,

Xu = testing$spc_p,

diss_method = "cor",

ws = 19)

#cd_mw$dissimilarity

hist(cd_mw$dissimilarity[,1], breaks=50)

plot(cd_mw$dissimilarity[,1], ylim = c(0,1),

ylab = "corr. diss.", col = "blue")

par(new=TRUE)

plot(cd_tr_ts$dissimilarity[,1],ylim = c(0,1),

ylab = " ", col = "red")

legend(x = "right", col=c("blue", "red"),

pch =20, legend = c("corr.", "window corr."))

As we can see the window size find higher correlations between the first sample of the test set and the training samples, but it is different for the rest. See the scatter plot for the 70th test sample:

There are less samples with high correlation and quite a lot with almost no correlation.

24 oct 2021

Modelling complex spectral data (soil) with the resemble package (VII)

This is the number 7 of the posts about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) ", where we try to understand the way Resemble package works, and to use their functions to analyze complex products as the case of soil.

We continue from post 6, were we saw how the dissimilarity matrix is calculated in the principal component space (orthogonal space), and how in that space we can project new samples to get their scores, and calculate the dissimilarity matrix between the training sample set and the test sample set.

Now the idea is to select for every sample, from the test set, a certain number of training samples which are neighbors of that sample defining how many neighbors to choose (by "knn"). These selected training samples are taken apart and a new principal component space is calculated, calculating a new dissimilarity matrix with new values for the distances . In this new dissimilarity matrix will have NA values for the training samples which are not selected.

This is the part of the vignette called: "Combine k-nearest neighbors and dissimilarity measures in the orthogonal space" where, different examples choosing a "knn" value of 200, are developed using different PCA methods, so you can practice.

In the case of the first test set sample the histogram of the neighbors distances between of the sample itself and the rest of the training samples is:

we have chosen one test sample which is quite apart from the majority of the training samples , but you can try with other test set samples and you get different distributions.

Taking apart the 200 most closer samples and developing a new PCA, the NH distances of this sample to the rest is:

so we have developed a local PCA space with the closer 200 training samples to project the first sample of the test set getting new NH values. We can do this as well with PLS choosing the constituent of interest.

19 oct 2021

Modelling complex spectral data (soil) with the resemble package (VI)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

Imagine that just two Principal Components would be enough to explain 99% of the variance, so we can see the samples in a unique plane. It is easy to calculate the distance between a sample and all the rest, just drawing lines and calculating their distance. After that we can write their value in a matrix where the diagonal would be cero (distance between the sample and itself). In this case, it is the training set so we have 618 samples (618 dots) and the matrix would be a matrix with 618 rows and 618 columns (618x618).

We can see cases where the samples are very close (blue circles), so their neighbor distance is very small (very low values), and we can consider (we saw as well in previous post) that their constituents’ values would be very similar.

In the case that we have more components to explain the variance (11 as we saw in previous post), the dimension of the matrix would be the same (618x618), but the distances would be not in a plane, if not in a multidimensional space.

This matrix is called "dissimilarity matrix" in the vignette, and has a great importance in the development of calculations and algorithms.

If we project a test set in the PC space, we get the distance between every sample of the test set vs. every sample of the training set. In the case of our test set we have 207 samples, so we would get a 618 x 207 matrix with the distances and we can see if the testing samples are well represented or if they can expand and make more robust the database.

17 oct 2021

Modelling complex spectral data (soil) with the resemble package (V)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

Continuing from last post: Once we have calculated the PC terms or components (11 in the case of the last PCA analysis using the method OPC), we define planes defined by the combinations of two of those terms (for example: PC1-PC2, PC2-PC3, PC1-PC3,…), and the training spectra is projected on the plane to get the scores of every spectrum vs. each PC component. All those scores are kept in a score matrix “T”. All the projections form a cloud that in the case of just two terms would be a 2D cloud, making easy the interpretations of the distances between every sample and the mean or their neighbors. But in the case of more dimensions it is a multivariate cloud, making the visual inspection more difficult, so we have to check the projections individually in 2D planes or 3D planes.

Algorithms like the Mahalanobis distance to the mean or to the neighbors will help us to check if the sample can be an outlier, it has very close neighbors (so it is represented by samples in theory similar), or if the sample has not closer neighbors and is a good sample to improve the structure of the database and make it more robust.

Let´s see in the case of the previous code one of those score planes, the one formed by the PC1 and PC2 terms:

plot(pca_tr_opc$scores[,1],pca_tr_opc$scores[,2],

xlim = c(min(pca_tr_opc$scores[,1]),

max(pca_tr_opc$scores[,1])),
ylim = c(min(pca_tr_opc$scores[,2]),

max(pca_tr_opc$scores[,2])),

xlab="PC1 ", ylab="PC2 ")

We can project the testing data on the same plane, getting the scores of the samples:

pca_projected <- predict(pca_tr_opc, newdata = testing$spc_p)

par(new=TRUE)

plot(pca_projected[,1],pca_projected[,2], col = "red",

xlim = c(min(pca_tr_opc$scores[,1]),

max(pca_tr_opc$scores[,1])),
ylim = c(min(pca_tr_opc$scores[,2]),

max(pca_tr_opc$scores[,2])),
xlab=" ", ylab=" ")

If we had only two PCs, this plane would be enough to show us the cloud, but we have 11 PCs. We can add graphically one more dimension (3D) and we see the cloud more clearly.

library(plotly)

T_training <- as.data.frame(pca_tr_opc$scores)
plot_ly(T_training, x=~T_training[,1], y=~T_training[,2],
z=~T_training[,3], alpha = 0.7)

Practice yourself and rotate the plot to different angles.

13 oct 2021

Modelling complex spectral data (soil) with the resemble package (III)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

(from previous post) - These 825 samples are divided in two sets, one for training (value equal 1 in the variable train) and another for testing (value = 0).

NIRsoil %>%
   count(train)
train n
    0        207
    1        618

Let´s create these two dataframes:

training<- NIRsoil[NIRsoil$train == 1, ]
testing<- NIRsoil[NIRsoil$train == 0, ]

Now the vignette explain how to proceed with the dimensionality reduction and the methods we can use for that purpose: pca, pca_nipals and pls.

You can practice each one (as in the vignette code) using the default values for the explained variance for every component (by default > 1%) or you can fit the cumulative variance you want.

PCA method use the Singular Value Decomposition algorithm, PCA_NIPALS, the NIPALS algorithm and PLS use (apart of the predictor variables (absorbances)) the response variables , maximizing the covariance of latent variables from predictors with the response (we have to specify which response variable to use).

Just follow the vignette examples for these three methods, and check how many components are recommended on each one.

For a more advanced use of the methods we can configure the "pc_selection" method, generating a list with the method ("var", "cumvar", "manual" or "opc"), and the value of the variance each component must explain ("var" case). In the case of "cumvar" we set the value to the total cumulative variance we want to explain, and we will see in more detail the case of the "opc" in another post.

Take into account that we will create the ortho-projections with the training set, and after, we can project new data (testing set) on the planes created with the training set, to get the scores of this new testing data.

11 oct 2021

Modelling complex spectral data (soil) with the resemble package (II)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux), now it is time to see the predictor variables which are reflectance values of the soil samples acquired in a NIR (Near Infrared Reflectance) instrument in the range from 1100 to 2498 nm in two nm steps, so we have 700 data points. We prepare a vector with the wavelengths and we call it "wav" (same as the vignette).

wavs<-NIRsoil$spc %>% colnames() %>% as.numeric()

Now we can se to the raw spectra (spectra without any treatment):

matplot(x = wavs, y = t(NIRsoil$spc),

xlab = "Wavelengths, nm",
ylab = "Absorbance", type = "l",

lty = 1, col = "#5177A133")

Now it is time to treat the spectra. In the vignette, the spectra is reduced in the number of data points and treated with a first derivative (Savitzky Gollay) using a first polynomial order and a window of 5. This reduction and signal improvement techniques are very useful to prepare the spectra "X" matrix for new reduction techniques after, saving this way computation time.

We just have to use the code of the vignette, but of course we are free to use other mathematical treatments to reduce the scatter effects or improve the resolution.

NIRsoil$spc_p <- NIRsoil$spc %>%

#we make a reduction of the number of data points

resample(wav = wavs, new.wav = seq(min(wavs),

max(wavs), by = 5)) %>%

#and apply to the spectra the Savitzky Golay function

#polynomial order =1

#window = 5

#first derivative

savitzkyGolay(p = 1, w = 5, m = 1)

Let´s create a new vector considering the wavelength reduction:

new_wavs <- as.matrix(as.numeric(colnames(NIRsoil$spc_p)))

and plot the spectra to see their appearance:

matplot(x = new_wavs, y = t(NIRsoil$spc_p),
xlab = "Wavelengths, nm",
ylab = "1st derivative",
type = "l", lty = 1, col = "#5177A133")

Now in the data frame "NIRsoil" we have two spectra matrices, the raw spectra (spc) and the spectra reduced and math treated with the SG first derivative (spc_p).

We can check the dimensions of these matrices:

names(NIRsoil)
   "Nt" "Ciso" "CEC" "train" "spc" "spc_p"
dim(NIRsoil$spc)
   825 700
dim(NIRsoil$spc_p)
   825 276

In the next post we will continue the preprocessing process and preparation of the data as the vignette suggest, trying to understand the different procedures to model, as better as possible, the soil spectral data.

7 oct 2021

Modelling complex spectral data (soil) with the resemble package (I)

It is time for a new tutorial working with R, and with one of their packages: "Resemble". We will use soil NIR spectra and we will follow the explanations given by the authors in this vignette:

Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

From certain time I am interested in the use of NIR spectra to develop models, and to follow this tutorial can help me to understand better how to apply several type of regressions and to see the performance for this complex matrix like soil.

Let´s create a new R Markdown file (.Rmd), and load the three libraries that we will use:

library(tidyverse)
library(resemble)
library(prospectr)
library(magrittr)

Let´s load the data "NIRsoil", and we can see that its class is "dataframe". This data frame combines the spectral matrix (NIRsoil$spec) which contains the predictor variables for every spectrum, with the responses variables (Nt, Ciso and CEC) and another variable which specify if the spectra is used in the training set or the test set. You can read at the vignette details from the authors about what each of these response variable represents and their units.

data("NIRsoil")

Run some code to have details about the number of samples available, the range for each response variable, how many "NA" you have, the wavelength range of the NIR spectrophotometer used,...., and other info that you consider useful before to go into other steps. For example to see the distribution of the Nitrogen:

NIRsoil$Nt %>%
summary()
NIRsoil%>%
ggplot(aes(Nt)) +
geom_histogram()

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.200 1.100 1.300 1.766 2.000 8.800 180

Do the same for the other constituents, changing the "Nt" response variable for the other variables "Ciso" and "CEC".

It is important to check how the response variables correlate between them:

response<- NIRsoil[ , 1:3] %>%
drop_na()
corrplot(cor(response), method = "number")

As we can see there is a high correlation between the parameters.

We can check it in a XY plot for thr "Ciso" and "Nt":

response %>%
ggplot(aes(x = Ciso , y = Nt)) +
geom_point()

This can be a good starting point and we will continue on a new coming post.