Let´s continue with the vignette: " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) .
As we saw, in previous posts, we can create several dissimilarity matrices using different methods, with the idea that when analyzing a sample (acquiring its spectrum) we can search which sample it is most similar to it (inside a database). In the case that the algorithm found a very similar (almost equal) there is a great probability that their characteristics (the concentration values of their components composition would be almost the same). It can happen that the sample found is similar but not enough in that case some characteristics could have a certain degree of similarity and others not, so it is necessary to continue filling the training database with more samples so for the next analysis the probabilities to find better similarities (with lower "knn" distance or higher "correlation") increase.
One of the functions of the package Resemble is "sim_eval". This function searches for the most similar observation (closest neighbor) of each observation in a given data set based on a dissimilarity (e.g. distance matrix). The observations are compared against their corresponding closest observations in terms of their side information provided (constituent values). The root mean square of differences (RMSD) and the correlation coefficient (R) are used for continuous variables and for discrete variables the kappa index is used.
The vignette calculate the dissimilarity matrices with all the methods available in Resemble, and try to find which one give the better performance for "Ciso" (Carbon in g/100 g of dry soil) parameter. Run the code and you will get the statistics for all of them:
We want to find the method with the lower "RMSD" and the higher "R". In the previous post I did not use all of them, but I use the "pcad", "cd" and "mcd" (Mahalanobis distances in orthogonal space, correlation distance and window mean correlation distance), but of course are other using PLS, euclidian distance, cosines, ....).
As we can see the best of the three I used is the "Mahalanobis distance in the orthogonal space", followed by the "Moving average correlation" and the "Correlation". But, as you can see, the best choice is the optimal PLS, and that make sense because the terms we use are more related with the constituent of interest.
Statistic numbers are fine to check the performance but graphics are also fine and the vignette show you how to get them: