30 nov 2021

Comparing methods with a new sample set (soil)

I did develop calibrations for soil texture with the three methods I use (LOCAL , PLS and ANN) with more than 2000 samples and I have a new set of samples to test them (samples scanned after the calibration development). The idea is to check which one performs better.

First as usual I look to the spectra and found that some of the new samples seems to have gypsum, so I mark these samples as spectra outliers, but I validate the results anyway, because the idea is to see  those sample marked (red cross) in the XY validation plot.


Now for the Silt parameter, I plot the validation set predicted vs. the reference values :

                                             SILT validation XY plots


The samples which seems to have gypsum seems to be well extrapolated with the PLS calibration, and not with the ANN. The LOCAL calibration did not found similar samples in the database and does not give predictions except for one of the samples.

Gypsum can be found on the three texture options, more in the clay fraction but in some areas  gypsum silt soils can be found and this could be the case. 

This samples are very good to extend the variability of the database and a good way to check how the calibrations work with extreme samples. When the database has a better distribution the ANN and LOCAL will perform better predicting these samples.



23 nov 2021

Gypsum soil spectra

NIR Soil spectra can have different shapes and bands. Normally when working with soil spectra we can identify the band associated with clay, carbonates, organic matter,... This time when looking to a soil spectra database I found some different shapes that ddid not interfere with other bands so they must be easily identified when consulting bibliography. These bands are for soil with high gypsum content.

In the figure there are high gypsum soils merged with other different type of soil spectra. See the triple band shape in the 1400 to 1550 nm area and the peak at 1748 nm with a clear band where the other soils did not absorb.








19 nov 2021

Modelling complex spectral data (soil) with the resemble package (X)

Let´s continue with the vignette: " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

All along the tutorial we have seen how to measure the distance of a spectrum in a orthogonal space to all the spectra of a certain training sample set. There are different kind of distances, but usually in the orthogonally space I use the Mahalanobis distance, but you can use others like the Euclidian for example. We just have to select the distance method we want when calculating dissimilarities.

Indeed the a distance we can calculate the correlation "R" of one sample versus all the samples in the training set. For this we use the spectrum (all the wavelengths we select) with or without math treatments. Normally we apply some math treatments to remove the scatter or to increase the resolution of the overlapped bands.

Other approach is to select a certain number of samples (for example from 100) but this way we select the 100 closer to the new sample but some of them can be far enough to be a different sample in composition and not good enough to create a custom calibration to predict accurately this new sample. Other approach is to select between a range of samples (for example 100 and 200), and apart from that the sample selected must confirm the requisite to be below a certain distance value (threshold) or over in the case of correlation. With the selected samples we can develop a regression (PLS) to predict the new sample. In the case not enough samples are found, we won´t get any result.

Of course, we can find with this method some drawbacks, as for example that the selected samples are very similar in composition and we won´t have enough variability to develop the PLS models.

In the case of the distance option we do it in a PLS space where the response variable is considered, so different samples can be chosen for every constituent of the same sample.

The vignette shows all this process very well, so it is just straight forward using the code.

There are cases where we want to force that some samples take part of the model, and this action is called “spike the neighborhoods”.



4 nov 2021

Practicing with soil own data and Resemble (I)

Along these days I am posting with Resemble following the vignette: Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) , (still more posts are coming), but it is time to check it with my own soil data.

I have imported a soil data set and split it into a training and a test set. I apply the Savitzky Golay first derivative:

Now I run the orthogonal principal component analysis, trying to find the optimal selection of the number of components for the Clay parameter.

optimal_sel <-  list(method = "opc", value = 40)
pca_training_opc <- ortho_projection(Xr = training$spc_nir_p,
                               Yr = training$Clay,
                               method = "pca", 
                               pc_selection = optimal_sel)
pca_training_opc
plot(pca_training_opc, col = "#FF1A00CC")

19 PC terms are chosen, that if you remember is the value which give the smallest RMSD between the clay lab value of every sample and the clay lab value of its closest neighbor. The figures show the election of the 19 terms and the XY plot where the RMSD is calculated for the training samples.


Finally I show you the texture triangle for the samples I am using (whole data set). I publish in a recent post a Youtube video from ISRIC where it shows how to obtain it with R.


2 nov 2021

Modelling complex spectral data (soil) with the resemble package (IX)

Let´s continue with the vignette: Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

As we saw, in previous posts, we can create several dissimilarity matrices using different methods, with the idea that when analyzing a sample (acquiring its spectrum) we can search which sample it is most similar to it (inside a database). In the case that the algorithm found a very similar  (almost equal) there is a great probability that their characteristics (the concentration values of their components composition would be almost the same). It can happen that the sample found is similar but not enough in that case some characteristics could have a certain degree of similarity and others not, so it is necessary to continue filling the training database with more samples so for the next analysis the probabilities to find better similarities (with lower "knn" distance or higher "correlation") increase.

One of the functions of the package Resemble is "sim_eval". This function searches for the most similar observation (closest neighbor) of each observation in a given data set based on a dissimilarity (e.g. distance matrix). The observations are compared against their corresponding closest observations in terms of their side information provided (constituent values). The root mean square of differences (RMSD) and the correlation coefficient (R) are used for continuous variables and for discrete variables the kappa index is used.

The vignette calculate the dissimilarity matrices with all the methods available in Resemble, and try to find which one give the better performance for "Ciso" (Carbon in g/100 g of dry soil) parameter. Run the code and you will get the statistics for all of them:



We want to find the method with the lower "RMSD" and the higher "R". In the previous post I did not use all of them, but I use the "pcad", "cd" and "mcd" (Mahalanobis distances in orthogonal space, correlation distance and window mean correlation distance), but of course are other using PLS, euclidian distance, cosines, ....).

As we can see the best of the three I used is the "Mahalanobis distance in the orthogonal space", followed by the "Moving average correlation" and the "Correlation". But, as you can see, the best choice is the optimal PLS, and that make sense because the terms we use are more related with the constituent of interest.

Statistic numbers are fine to check the performance but graphics are also fine and the vignette show you how to get them: