29 dic 2021

Getting accurate predictions from large and diverse spectral libraries

 


This is a webinar (organized by Soil.Spectroscopy.Org) I have joined some time ago, and now that I am working with LUCAS database I saw again. 

In my case I have taken apart the Spanish soils from the LUCAS database and I am working with them, finding very interesting conclusions I will share with you all along next year 2022.

Leonardo Ramirez Lopez is a well-known R Chemometric package developer and present a nice simulation about how to work with the Lucas database (where a lot of samples from all the Europe Union Community have been analysed using a unique NIR instrument and a unique laboratory) adding noise to the predictors (spectra) and to the responses (reference values) that is normally the case we will find in most of the situations.




24 dic 2021

Merry Christmas Everybody

Slade was one of the groups I liked, far long ago, and is the group I choose this year to wish  Merry Christmas to all the NIR_quimiometria followers. Have a very happy and safe Christmas.

15 dic 2021

Mahalanobis distance to detect the gypsum samples

With all the soil samples we calculate the Principal Components scores and after the mahalanobis distance of every sample sores to the mean, and we draw a cut-off of 3, which normally is used to select as outliers the samples which are over it.



We can take those samples apart an see the spectra, to check their characteristics and in this case the spectra show that almost all are the gypsum samples, which we could not see clearly when plotted with the rest of the samples due to the overlapping.



We can select the rest of the samples and calculate a mew PC analysis and continue studying the characteristics of the samples.

14 dic 2021

A loading related to gypsum

 Following the chain of posts related to gypsum in the soil samples, we continue calculating the principal components of the sample set, and plotting the 5 firsts loadings, after we draw vertical lines where the reference gypsum spectrum has the main peaks, and we can see that the 5th loading (light blue) shows clearly the gypsum peaks.

matplot(rownames(pcspectra_snvdt$rotation),
        pcspectra_snvdt$rotation[ ,1:5], 
        type = "l", main = "Loadings spectra", 
        ylab = "Absorbance", xlab = "wavelength" )

abline(v = c(994, 1204, 1445, 1489, 1537, 1748, 
             1944, 2215, 2266), col = "green", 
       lty = "dotted")



10 dic 2021

Are there soil samples with gypsum in our sample set?

One way to see it is overplotting the reference spectrum of gypsum on our soil sample set (in this case treated with Detrend to remove some of the scatter effects). We don´t see it clearly due to the high quantity of samples, but it seems clearly that are samples with gypsum on our sample set. 

Now we could find some ways to check which of our samples have higher correlation with the gypsum reference spectrum or looking other metrics like distances.

Having a reference sample set with gypsum, calcite, kaolinite, iron oxide, etc, is a good way to explore and play with the data, overplotting them over our sample set.

In this case due that the reference gypsum spectrum and the soil sample set are scanned on different NIR instruments, a common range was used (400 to 2500 nm)

matplot(x =colnames(lucas_spain_spcdt), t(lucas_spain_spcdt), 
        type = "l", col = rgb(red = 0.5, green = 0.5, 
        blue = 0.5, alpha = 0.3), ylim = c(-1.5, 4.0), 
        xlab = "wavelength", ylab = "Absorbance")

par(new = TRUE)

matplot(wavelength2, mineralRef2, type = "l", col = "red",                    xlab = "", ylab = "", main = " ", ylim = c(-1.5, 4.0))
abline(v = c(994, 1204, 1445, 1489, 1537, 1748, 
             1944, 2215, 2266), col = "green", 
             lty = "dotted")




8 dic 2021

Identifying gypsum peaks

 Identify the peaks in the gypsum spectrum, using the function "peaks" from the package "IDPmisc".

You can obtain the gypsum spectra from the data(mineralRef) in the package "soilspec". Th spectrum is in reflectance and I converted it to absorbance using the log 1/R transformation.

wavelength <- seq(350, 2500, by = 1)
matplot(wavelength, mineralRef$gypsum, 
        type = "l", col = "blue", 
        xlab = "wavelength", 
        ylab = "Absorbance", 
        main = "Gypsum" )

With this code we get the spectrum, and see (visually) the peaks, so we can decide the value for the arguments of the function.

Now let´s get the peaks:

ppeaks_gypsum <- peaks(wavelength, 
                 mineralRef$gypsum , 
                 minPH = 0.03)

> ppeaks_gypsum
      x         y   w
1   350 0.1364889 114
3   994 0.1623408  39
4  1204 0.2884482  67
5  1445 1.0510481 127
8  1489 0.8744469  11
7  1537 0.7541421  14
6  1748 0.7571059  54
9  1944 2.1366552  83
10 2215 1.3251868  43
11 2266 1.2514346  15
2  2488 2.1942896 191

In the X column we have the peak wavelength, in the "y" the absorbance values, and in the "w" the width at half maximum of the peaks.

Now we add the vertical lines to see the marked peaks:

abline(v = ppeaks_gypsum$x, col = "red")

Just change the minPH value or the or the other arguments in the peaks function to get more or less peaks.



We can exclude the vertical lines at the extreme wavelengths.

We can use this spectrum to compare it to our soil samples checking for any trace of gypsum on them.

30 nov 2021

Comparing methods with a new sample set (soil)

I did develop calibrations for soil texture with the three methods I use (LOCAL , PLS and ANN) with more than 2000 samples and I have a new set of samples to test them (samples scanned after the calibration development). The idea is to check which one performs better.

First as usual I look to the spectra and found that some of the new samples seems to have gypsum, so I mark these samples as spectra outliers, but I validate the results anyway, because the idea is to see  those sample marked (red cross) in the XY validation plot.


Now for the Silt parameter, I plot the validation set predicted vs. the reference values :

                                             SILT validation XY plots


The samples which seems to have gypsum seems to be well extrapolated with the PLS calibration, and not with the ANN. The LOCAL calibration did not found similar samples in the database and does not give predictions except for one of the samples.

Gypsum can be found on the three texture options, more in the clay fraction but in some areas  gypsum silt soils can be found and this could be the case. 

This samples are very good to extend the variability of the database and a good way to check how the calibrations work with extreme samples. When the database has a better distribution the ANN and LOCAL will perform better predicting these samples.



23 nov 2021

Gypsum soil spectra

NIR Soil spectra can have different shapes and bands. Normally when working with soil spectra we can identify the band associated with clay, carbonates, organic matter,... This time when looking to a soil spectra database I found some different shapes that ddid not interfere with other bands so they must be easily identified when consulting bibliography. These bands are for soil with high gypsum content.

In the figure there are high gypsum soils merged with other different type of soil spectra. See the triple band shape in the 1400 to 1550 nm area and the peak at 1748 nm with a clear band where the other soils did not absorb.








19 nov 2021

Modelling complex spectral data (soil) with the resemble package (X)

Let´s continue with the vignette: " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

All along the tutorial we have seen how to measure the distance of a spectrum in a orthogonal space to all the spectra of a certain training sample set. There are different kind of distances, but usually in the orthogonally space I use the Mahalanobis distance, but you can use others like the Euclidian for example. We just have to select the distance method we want when calculating dissimilarities.

Indeed the a distance we can calculate the correlation "R" of one sample versus all the samples in the training set. For this we use the spectrum (all the wavelengths we select) with or without math treatments. Normally we apply some math treatments to remove the scatter or to increase the resolution of the overlapped bands.

Other approach is to select a certain number of samples (for example from 100) but this way we select the 100 closer to the new sample but some of them can be far enough to be a different sample in composition and not good enough to create a custom calibration to predict accurately this new sample. Other approach is to select between a range of samples (for example 100 and 200), and apart from that the sample selected must confirm the requisite to be below a certain distance value (threshold) or over in the case of correlation. With the selected samples we can develop a regression (PLS) to predict the new sample. In the case not enough samples are found, we won´t get any result.

Of course, we can find with this method some drawbacks, as for example that the selected samples are very similar in composition and we won´t have enough variability to develop the PLS models.

In the case of the distance option we do it in a PLS space where the response variable is considered, so different samples can be chosen for every constituent of the same sample.

The vignette shows all this process very well, so it is just straight forward using the code.

There are cases where we want to force that some samples take part of the model, and this action is called “spike the neighborhoods”.



4 nov 2021

Practicing with soil own data and Resemble (I)

Along these days I am posting with Resemble following the vignette: Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) , (still more posts are coming), but it is time to check it with my own soil data.

I have imported a soil data set and split it into a training and a test set. I apply the Savitzky Golay first derivative:

Now I run the orthogonal principal component analysis, trying to find the optimal selection of the number of components for the Clay parameter.

optimal_sel <-  list(method = "opc", value = 40)
pca_training_opc <- ortho_projection(Xr = training$spc_nir_p,
                               Yr = training$Clay,
                               method = "pca", 
                               pc_selection = optimal_sel)
pca_training_opc
plot(pca_training_opc, col = "#FF1A00CC")

19 PC terms are chosen, that if you remember is the value which give the smallest RMSD between the clay lab value of every sample and the clay lab value of its closest neighbor. The figures show the election of the 19 terms and the XY plot where the RMSD is calculated for the training samples.


Finally I show you the texture triangle for the samples I am using (whole data set). I publish in a recent post a Youtube video from ISRIC where it shows how to obtain it with R.


2 nov 2021

Modelling complex spectral data (soil) with the resemble package (IX)

Let´s continue with the vignette: Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

As we saw, in previous posts, we can create several dissimilarity matrices using different methods, with the idea that when analyzing a sample (acquiring its spectrum) we can search which sample it is most similar to it (inside a database). In the case that the algorithm found a very similar  (almost equal) there is a great probability that their characteristics (the concentration values of their components composition would be almost the same). It can happen that the sample found is similar but not enough in that case some characteristics could have a certain degree of similarity and others not, so it is necessary to continue filling the training database with more samples so for the next analysis the probabilities to find better similarities (with lower "knn" distance or higher "correlation") increase.

One of the functions of the package Resemble is "sim_eval". This function searches for the most similar observation (closest neighbor) of each observation in a given data set based on a dissimilarity (e.g. distance matrix). The observations are compared against their corresponding closest observations in terms of their side information provided (constituent values). The root mean square of differences (RMSD) and the correlation coefficient (R) are used for continuous variables and for discrete variables the kappa index is used.

The vignette calculate the dissimilarity matrices with all the methods available in Resemble, and try to find which one give the better performance for "Ciso" (Carbon in g/100 g of dry soil) parameter. Run the code and you will get the statistics for all of them:



We want to find the method with the lower "RMSD" and the higher "R". In the previous post I did not use all of them, but I use the "pcad", "cd" and "mcd" (Mahalanobis distances in orthogonal space, correlation distance and window mean correlation distance), but of course are other using PLS, euclidian distance, cosines, ....).

As we can see the best of the three I used is the "Mahalanobis distance in the orthogonal space", followed by the "Moving average correlation" and the "Correlation". But, as you can see, the best choice is the optimal PLS, and that make sense because the terms we use are more related with the constituent of interest.

Statistic numbers are fine to check the performance but graphics are also fine and the vignette show you how to get them:



31 oct 2021

3 GSIF course: Structures and functions for soil point data

Interesting lecture. It is nice in the case to get predictions from a texture NIR calibration to see how the predictions are  placed in the texture triangle. I will practice with this and let you know.

Modelling complex spectral data (soil) with the resemble package (VIII)

This is the post number 8 about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)   ", where we try to understand the way  Resemble package works, and to use their functions to analyze complex products as the case of soil.



We have seen how to calculate the dissimilarity matrix for a sample set in the orthogonal space using the Mahalanobis distance, but there are other  calculation methods for dissimilarities. 

The simplest one can be the correlation method, where we calculate the correlation between every sample (spectrum) of a sample set and all the rest (spectra), for example of a training set, but we can calculate, as well, the correlation of every sample of the test set vs. the samples of the training set, or even of a new unknown sample spectrum vs all the spectra of the training set. This way we can define a threshold and select samples over a certain correlation value to do something special with them (as for example a quantitative model).

The vignette show the code to calculate the dissimilarity matrix for the training set:

cd_tr <- dissimilarity(Xr = training$spc_p, diss_method = "cor")
dim(cd_tr$dissimilarity)
cd_tr$dissimilarity

As in the case oh the Mahalanobis distance, the matrix has the same size, so it is square and diagonal. We can check the distribution of correlations between any sample (in this case the first one) and the rest in a histogram:

hist(cd_tr$dissimilarity[,1], breaks=50)
We can do the same for the test set:

cd_tr_ts <- dissimilarity(Xr = training$spc_p,
                          Xu = testing$spc_p,
                          diss_method = "cor")
dim(cd_tr_ts$dissimilarity)
hist(cd_tr_ts$dissimilarity[,1], breaks=50)

Other correlation method to calculate is with a moving window of a certain size (a certain number of data points), so for every sample we have several correlations (total number of data points divided by the size of the window) and calculate the average.
In the case of the first sample of the test set, we can see in a scatter plot which method find higher correlated samples from the training set:

cd_mw <- dissimilarity(Xr = training$spc_p,
                       Xu = testing$spc_p,
                       diss_method = "cor",
                       ws = 19)
#cd_mw$dissimilarity
hist(cd_mw$dissimilarity[,1], breaks=50)
plot(cd_mw$dissimilarity[,1], ylim = c(0,1), 
ylab = "corr. diss.", col = "blue")
par(new=TRUE)
plot(cd_tr_ts$dissimilarity[,1],ylim = c(0,1),
ylab = " ", col = "red")
legend(x = "right", col=c("blue", "red"), 
pch =20, legend = c("corr.", "window corr."))
As we can see the window size find higher correlations between the first sample of the test set and the training samples, but it is different for the rest. See the scatter plot for the 70th test sample:
There are less samples with high correlation and quite a lot with almost no correlation.






29 oct 2021

Webinar - A Comparison of VNIR and MIR Spectroscopy

This is an interesting webinar recording, because it gives details about what we can expect when developing calibrations for NIR in soils. Some of the parameters are difficult to calibrate, and we need a good database and chemometric tools to improve the statistics.
We can see graphics and statistics you can use as a guide or reference for your calibrations.

Introduction to Pedometrics and Digital Soil Mapping

24 oct 2021

Modelling complex spectral data (soil) with the resemble package (VII)

This is the number 7 of the posts about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)   ", where we try to understand the way  Resemble package works, and to use their functions to analyze complex products as the case of soil.

We continue from post 6, were we saw how the dissimilarity matrix is calculated  in the principal component space (orthogonal space),  and how in that space we can project new samples to get their scores,  and calculate the dissimilarity matrix between the training sample set and the test sample set.

Now the idea is to select for every sample, from the test set, a certain number of training samples which are neighbors of that sample defining how many neighbors to choose (by "knn"). These selected training samples are taken apart and a new principal component space is calculated, calculating a new dissimilarity matrix with new values for the distances  . In this new dissimilarity matrix will have NA values for the training samples which are not selected. 

This is the part of the vignette called: "Combine k-nearest neighbors and dissimilarity measures in the orthogonal space" where, different examples choosing a "knn" value of 200, are developed using different PCA methods, so you can practice.

In the case of the first test set sample the histogram of the neighbors distances between of the sample itself and the rest of the training samples is:


we have chosen one test sample which is quite apart from the majority of the training samples , but you can try with other test set samples and you get different distributions.

Taking apart the 200 most closer samples and developing a new PCA, the NH distances of this sample to the rest is:

so we have developed a local PCA space with the closer 200 training samples to project the first sample of the test set getting new NH values. We can do this as well with PLS choosing the constituent of interest.


19 oct 2021

Modelling complex spectral data (soil) with the resemble package (VI)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

Imagine that just two Principal Components would be enough to explain 99% of the variance, so we can see the samples in a unique plane. It is easy to calculate the distance between a sample and all the rest, just drawing lines and calculating their distance. After that we can write their value in a matrix where the diagonal would be cero (distance between the sample and itself). In this case, it is the training set so we have 618 samples (618 dots) and the matrix would be a matrix with 618 rows and 618 columns (618x618).

We can see cases where the samples are very close (blue circles), so their neighbor distance is very small (very low values), and we can consider (we saw as well in previous post) that their constituents’ values  would be very similar.

In the case that we have more components to explain the variance (11 as we saw in previous post), the dimension of the matrix would be the same (618x618), but the distances would be not in a plane, if not in a multidimensional space.

This matrix is called "dissimilarity matrix" in the vignette, and has a great importance in the development of calculations and algorithms.

 If we project a test set in the PC space, we get the distance between every sample of the test set vs. every sample of the training set. In the case of our test  set we have 207 samples, so we would get a 618 x 207 matrix with the distances and we can see if the  testing samples are well represented or if they can expand and make more robust the database.

17 oct 2021

Modelling complex spectral data (soil) with the resemble package (V)

 Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

Continuing from last post: Once we have calculated the PC terms or components (11 in the case of the last PCA analysis using the method OPC), we define planes defined by the combinations of two of those terms (for example: PC1-PC2, PC2-PC3, PC1-PC3,…), and the training spectra is projected on the plane to get the scores of every spectrum vs. each PC component. All those scores are kept in a score matrix “T”. All the projections form a cloud that in the case of just two terms would be a 2D cloud, making easy the interpretations of the distances between every sample and the mean or their neighbors. But in the case of more dimensions it is a multivariate cloud, making the visual inspection more difficult, so we have to check the projections individually in 2D planes or 3D planes.

Algorithms like the Mahalanobis distance to the mean or to the neighbors will help us to check if the sample can be an outlier, it has very close neighbors (so it is represented by samples in theory similar), or if the sample has not closer neighbors and is a good sample to improve the structure of the database and make it more robust.

Let´s see in the case of the previous code one of those score planes, the one formed by the PC1 and PC2 terms:

plot(pca_tr_opc$scores[,1],pca_tr_opc$scores[,2],
    xlim = c(min(pca_tr_opc$scores[,1]),
    max(pca_tr_opc$scores[,1])),
    ylim = c(min(pca_tr_opc$scores[,2]),
    max(pca_tr_opc$scores[,2])),
    xlab="PC1 ", ylab="PC2 ")

We can project the testing data on the same plane, getting the scores of the samples:

pca_projected <- predict(pca_tr_opc, newdata = testing$spc_p)

par(new=TRUE)

plot(pca_projected[,1],pca_projected[,2], col = "red", 
    xlim = c(min(pca_tr_opc$scores[,1]),
    max(pca_tr_opc$scores[,1])),
    ylim = c(min(pca_tr_opc$scores[,2]),
    max(pca_tr_opc$scores[,2])),
    xlab=" ", ylab=" ")

If we had only two PCs, this plane would be enough to show us the cloud, but we have 11  PCs. We can add graphically one more dimension (3D) and we see the cloud more clearly.

library(plotly)
T_training <- as.data.frame(pca_tr_opc$scores)
plot_ly(T_training, x=~T_training[,1], y=~T_training[,2],
             z=~T_training[,3], alpha = 0.7)

Practice yourself and rotate the plot to different angles.

14 oct 2021

Modelling complex spectral data (soil) with the resemble package (IV)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

Now we will use the PCA with the method “opc” in order to find the optimal number of components, bases on the its rationale behind that if two spectra are close in the X space (near neighbors), their constituents values will be closer as well on its value, so the optimal number of components will be the one that makes minimum the RMSD (root mean square difference) between them.

For more details you can find more info from the developers of this algorithm : L. Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. (2013

optimal_sel <- list(method = "opc", value = 40)

pca_tr_opc <- ortho_projection
              (Xr = training$spc_p,
               Yr = training$Ciso,
               method = "pca",
               pc_selection = optimal_sel)

pca_tr_opc     # to obtain details of the PCA calculations.

 

We specify a maximum value of 40, and the “opc” method estimate that 11 is the best option. If we plot it, we can see graphically the reason:

The vignette shows an interesting code, that if you run it will get the XY plot of the reference Ciso value (for every spectrum) and the reference Ciso value for its closer neighbor, and we can se a high correlation what is really the idea behind the “opc” method.

I found these option very interesting, so we will continue exploring the vignette that sure will help for the purpose of its title.



13 oct 2021

Modelling complex spectral data (soil) with the resemble package (III)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

(from previous post) - These 825 samples are divided in two sets, one for training (value equal 1 in the variable train) and another for testing (value = 0). 

 NIRsoil %>%
    count(train)

    train       n
     0         207
     1         618

Let´s create these two dataframes:

training<- NIRsoil[NIRsoil$train == 1, ]
testing<- NIRsoil[NIRsoil$train == 0, ]


Now the vignette explain how to proceed with the dimensionality reduction and the methods we can use for that purpose: pca, pca_nipals and pls.

You can practice each one (as in the vignette code) using the default values for the explained  variance  for every component  (by default > 1%) or you can fit the cumulative variance you want.

PCA method use the Singular Value Decomposition algorithm, PCA_NIPALS, the NIPALS algorithm and PLS use (apart of the predictor variables (absorbances)) the response variables , maximizing the covariance of latent variables from  predictors with the response (we have to specify which response variable to use).

Just follow the vignette examples for these three methods, and check how many components are recommended on each one.

For a more advanced use of the methods we can configure the "pc_selection" method, generating a list with the method ("var", "cumvar", "manual" or "opc"), and the value of the variance each component  must explain ("var" case). In the case of "cumvar" we set the value to the total cumulative variance we want to explain, and we will see in more detail the case of the "opc" in another post.

Take into account that we will create the ortho-projections with the training set, and after, we can project new data (testing set) on the planes created with the training set, to get the scores of this new testing data.

11 oct 2021

Modelling complex spectral data (soil) with the resemble package (II)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux), now it is time to see the predictor variables which are reflectance values of the soil samples acquired in a NIR (Near Infrared Reflectance) instrument in the range from 1100 to 2498 nm in two nm steps, so we have 700 data points. We prepare a vector with the wavelengths and we call it "wav" (same as the vignette).

wavs<-NIRsoil$spc %>% colnames() %>% as.numeric()

Now we can se to the raw spectra (spectra without any treatment):

matplot(x = wavs, y = t(NIRsoil$spc),  
        xlab = "Wavelengths, nm",
        ylab = "Absorbance", type = "l", 
        lty = 1, col = "#5177A133")

Now it is time to treat the spectra. In the vignette, the spectra is reduced in the number of data points and treated with a first derivative (Savitzky Gollay) using a first polynomial order and a window of 5. This reduction and signal improvement techniques are very useful to prepare the spectra "X" matrix for new reduction techniques after, saving this way computation time.

We just have to use the code of the vignette, but of course we are free to use other mathematical treatments to reduce the scatter effects  or improve the resolution.

NIRsoil$spc_p <- NIRsoil$spc %>% 
  #we make a reduction of the number of data points 
resample(wav = wavs, new.wav = seq(min(wavs), 
         max(wavs), by = 5)) %>% 
  #and apply to the spectra the Savitzky Golay function
  #polynomial order =1
  #window = 5
  #first derivative
  savitzkyGolay(p = 1, w = 5, m = 1)

Let´s create a new vector considering the wavelength reduction:

new_wavs <- as.matrix(as.numeric(colnames(NIRsoil$spc_p)))

and plot the spectra to see their appearance:

matplot(x = new_wavs, y = t(NIRsoil$spc_p),
        xlab = "Wavelengths, nm",
        ylab = "1st derivative",
        type = "l", lty = 1, col = "#5177A133")

Now in the data frame "NIRsoil" we have two spectra matrices, the raw spectra (spc) and the spectra reduced and math treated with the SG first derivative (spc_p).

We can check the dimensions of these matrices:

names(NIRsoil)
    "Nt" "Ciso" "CEC" "train" "spc" "spc_p"
dim(NIRsoil$spc)
    825 700
dim(NIRsoil$spc_p)
    825 276

In the next post we will continue the preprocessing process and  preparation of the data as the vignette suggest, trying to understand the different procedures to model, as better as possible, the soil spectral data.

7 oct 2021

Modelling complex spectral data (soil) with the resemble package (I)

It is time for a new tutorial working with R, and with one of their packages: "Resemble". We will use soil NIR spectra and we will follow the explanations given by the authors in this vignette:

Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

From certain time I am interested in the use of NIR spectra to develop models, and to follow this tutorial can help me to understand better how to apply several type of regressions and to see the performance for this complex matrix like soil.


Let´s create a new R Markdown file (.Rmd), and load the three libraries that we will use:

library(tidyverse)
library(resemble)
library(prospectr)
library(magrittr)


Let´s load the data "NIRsoil", and we can see that its class is "dataframe". This data frame combines the spectral matrix (NIRsoil$spec) which contains the predictor variables for every spectrum, with the responses variables (Nt, Ciso and CEC) and another variable which specify if the spectra is used in the training set or the test set. You can read at the vignette details from the authors about what each of these response variable represents and their units.

data("NIRsoil")

Run some code to have details about the number of samples available, the range for each response variable, how many "NA" you have, the wavelength range of the NIR spectrophotometer used,...., and other info that you consider useful before to go into other steps. For example to see the distribution of the Nitrogen:

NIRsoil$Nt %>%
summary()
NIRsoil%>%
ggplot(aes(Nt)) +
geom_histogram()


Min.   1st Qu.  Median   Mean  3rd Qu.   Max.    NA's 
0.200   1.100   1.300   1.766   2.000   8.800     180 


Do the same for the other constituents, changing the "Nt" response variable for the other variables "Ciso" and "CEC".

It is important to check how the response variables correlate between them:

response<- NIRsoil[ , 1:3] %>%
drop_na()
corrplot(cor(response), method = "number")



As we can see there is a high correlation between the parameters.
We can check it in a XY plot for thr "Ciso" and "Nt":

response %>%
ggplot(aes(x = Ciso , y = Nt)) +
geom_point()


This can be a good starting point and we will continue on a new coming post.

15 ago 2021

Tidymodels: Modeling hotel bookings in R using tidymodels and recipes

In order to use the tidymodels and tidyverse packages to the spectroscopic data we have to familiarize with their functions , so we have to practice with tutorials from other data sets, like in this case with data from hotels bookings.
So lets see Julia working with this data set, and try to practice doing yourself with the available material in tidytuesday and tidimodels websites.


There is a tutorial about Infratec meat spectra with tidymodels (not in video), using PLS , so we will have to look at it in other post.

12 ago 2021

Time based validation improvement for pH in cocoa paste

 The best way to check the performance of a calibration is with a new time based validation set. In this case one calibration has been developed to predict pH in cocoa paste. The calibration has been developed choosing the best performance with a cross validation using groups. 

After this the model has been installed in routine and after 2 months, new samples with reference values attached have been collected so we can run the validation to see the statistics. This is the performance:


One of the sample with a high GH is classified as outlier, so we can think that it was a lab error value and we can excluded without a good reason and that must not be done.

Why not to develop again the model with the old training set trying to find a better math-treatment or configuration and validate with this new time based set. The results are surprising with a "None-0-0-1-1" math and 15 PLS terms:

RSQ increase and the sample with the high pH is predicted fine. Take into account that we have a very sort range for pH so we can not expect a high RSQ value.