1 feb 2018

Correlation between scores and PC / PLS terms

First PC search to explain the maximum variability in the X matrix. Once extracted the second PC explain the maximum variance remaining, and the process is repeated until almost all the important variance is explained and the remaining variance is the noise and we don´t want it to incorporate this variance into the model.

Once we have the terms, samples are projected over the several PC terms and every sample has a score for every term. Therefore, we have a score matrix with “N” samples (rows) and “A” components (columns).

This variance can be due to different sources or mixture of sources.

In the case of PLS we are looking for a compromise explaining the maximum possible variance in X, at the same time that we explain a maximum variance in Y. We have also a score matrix when developing the PLS algorithm and this scores have more correlation with the constituent that the scores calculated with PC.

In the case of the soy meal in the conveyor, we can calculate the correlation between the scores for every  of the four PC and the protein:

> cor(scores_4t_pc[,1:4],soy_ift_prot1r1$Prot)

               [,1]
PC term 1  -0.2105997
PC term 2   0.3445256
PC term 3   0.1647146
PC term 4  -0.6888083

We can do the same, but with the scores of the PLS regression:

> cor(Prot_plsr_r1$scores[,1:4],soy_ift_prot1r1$Prot)

            [,1]
Comp 1 0.2129742
Comp 2 0.4193727
Comp 3 0.5348858
Comp 4 0.4425912

As I can see the correlations are higher for the PLS, but there are some curiosities about the PC scores that we can try to check yin future posts.

No hay comentarios:

Publicar un comentario