1 abr. 2016

Tutorials with Resemble (Part 4.b)

Using the NIRsoil demo spectra from Resemble, we can practice the function "orthoProjection", as the "resemble.pdf" manual explains.
In this case we use "orthoProjection" using the "pca method", that as we saw in the previous post, it uses the SVD PCA calcutation method.
OrthoProjection can select a maximum of 40 PCs, but an algorithm is used to select the maximum recommended value.
We can start this part of the tutorial with:

pcProj<-orthoProjection(Xr=X_train,X2=NULL,Yr=Y_train,
+ method="pca",pcSelection=list("opc",40))

As we can see we have use for “pcselection, then OPC method, and the list of possible terms goes from 1 to 40. Of course not all would be necessary and the OPC method will decide the number selected.
In the resemble manual we can read:

“When method = "opc", the selection of the components is carried out by using an iterative method based on the side information concept (Ramirez-Lopez et al. 2013a, 2013b). First let be P a sequence of retained components (so that P = 1; 2; :::; k. At each iteration, the function computes a dissimilarity matrix retaining pi components. The values of the side information of the samples are compared against the side information values of their most spectrally similar samples. The optimal number of components retrieved by the function is the one that minimizes the root mean squared differences (RMSD) in the case of continuous variables”.

If we check (after the pcProj calculation):
> pcProj\$n.components
[1] 20

We can see that the number selected is 20.
We can see this more graphically in a plot:
> plot(pcProj)

For this calculation (using the "opc" principal componets selection) the reference Matrix "Yr" (reference matrix) is needed apart from the spectra matrix "Xr".
We cas see the list of values for the RMSD in
>pcProj\$opcEval

In case we use other method, like cummulative variance,  "Yr" will be not needed. Of course if the method used for the orthoProjection is "pls" indeed "pca", the reference matrix "Yr" will always be needed.
The principal components space (with the number of components selected) will be used for the calculation of the Mahalanobis distance (distance to the PC centroid) for every sample un the validation spectra matrix "Xu".