19 oct 2021

Modelling complex spectral data (soil) with the resemble package (VI)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) 

Imagine that just two Principal Components would be enough to explain 99% of the variance, so we can see the samples in a unique plane. It is easy to calculate the distance between a sample and all the rest, just drawing lines and calculating their distance. After that we can write their value in a matrix where the diagonal would be cero (distance between the sample and itself). In this case, it is the training set so we have 618 samples (618 dots) and the matrix would be a matrix with 618 rows and 618 columns (618x618).

We can see cases where the samples are very close (blue circles), so their neighbor distance is very small (very low values), and we can consider (we saw as well in previous post) that their constituents’ values  would be very similar.

In the case that we have more components to explain the variance (11 as we saw in previous post), the dimension of the matrix would be the same (618x618), but the distances would be not in a plane, if not in a multidimensional space.

This matrix is called "dissimilarity matrix" in the vignette, and has a great importance in the development of calculations and algorithms.

 If we project a test set in the PC space, we get the distance between every sample of the test set vs. every sample of the training set. In the case of our test  set we have 207 samples, so we would get a 618 x 207 matrix with the distances and we can see if the  testing samples are well represented or if they can expand and make more robust the database.

No hay comentarios:

Publicar un comentario