R & Chemometrics: febrero 2014

23 feb 2014

NIPALS: Looking to the scores 3D plots

These are some 3D plots we can get from R to understand better the NIPALS function, and spectra reconstruction we have been talking in the lasts posts.
We can see the 3D cube score plots, draw the orthogonal projections to the different planes (in this case to the plane formed by PC1 and PC2, and so on

library(scatterplot3d)
scatterplot3d(T1_nipals,T2_nipals,T3_nipals)
scatterplot3d (T1_nipals,T2_nipals,T3_nipals,
+pch=16,highlight.3d=TRUE,type="h")

We can draw also this plot and rotate it with the mouse to get the orientation we prefer.
library(rgl)
plot3d(T1_nipals,T2_nipals,T3_nipals,col="red",size=5)

Another nice plot that we can rotate with the mouse is this one:

library(Rcmdr)

scatter3d(sflw.msc3.tra$G00rmn,T2_nipals,ç

+ T3_nipals,surface=FALSE,grid=FALSE)

In this case one of the axes is the constituent, and we can see the patterns of correlation between the constituent and the scores.

If interested in this type of plots, you can read the post: 3D plots in R from the Revolution blog.

Once we have our PCA space, objects can be projected into the PCA space, the distance from the object to this projection is the orthogonal distance (O.D.), this projection will have a score distance (S.D), respect to the center and for every PC.

We can have three kinds of situations:

Large O.D. but short S.D
Large O.D and large S.D. (Bad leverage point)
Short O.D and large S.D. (Good leverage point)

It is important a plot to detect this type of outliers, and to fix some limits for detection.
pcaDiagplot(sflw.msc3.tra$NIRmsc,X.pca=X_nipals,
+ a=4,pch=c(20:24)[sflw.msc3.tra$Operator],
+ col=c("red","blue","green","brown")
+ [sflw.msc3.tra$Operator])

In the table of distances, we can check the outliers for the O.D and S.D. We have three samples which have a large OD and a large SD at the same time (14,80,87), two samples with a large OD but the SD into the limits (79,85), and two samples with a large SD, but the OD is into the limits (19,86).

outliers<-sflw.msc3.tra$NIRmsc[c(14,80,87,79,85,19,86),]
matplot(wave.NIR,t(outliers),type="l",lty=1,xlab="nm",
+ ylab="log 1/R")

15 feb 2014

NIPALS: Correlation of the Scores with the Constituents

This post continues with the understanding of calculation of the Principals Components with R, I have written some posts before you can read before to continue:
Using R: NIPALS Spectra Reconstruction
Spectra Reconstruction (Adding the Mean Spectra to the Centered ones)

Once we have the "T" (scores matrix), we can check the correlation of the scores (of a certain principal component) with the different constituents. In this case we are going to check the correlation of the PC1, PC2 and PC3 with the constituent Fat.

I have the values for fat in my training data set, the same from which I had developped the PCA , so now it is easy:

cor(T1_nipals,sflw.msc3.tra$G00rmn)    #-0.08544355
cor(T2_nipals,sflw.msc3.tra$G00rmn)    #-0.4114215
cor(T3_nipals,sflw.msc3.tra$G00rmn)    # 0.8096445

We can see than in the first one there are not correlation at all, the second one has a negative correlation, but the third one has a high positive correlation.

If we are use to the patterns of the fat bands in the NIR region, maybe looking to the "P" loading matrix, we can confirm this.

All the loadings have peaks related to fat, because this product has a high fat content (sunflower seed), but the first one seems to have a high correlation with moisture, in the second one the peaks are negative, and in the third positive. Remember when looking to these plots that we are working with centered data.

#Main peaks in PC1:negative: 1455.7 1937.4 positive:2307.2 2347.5

#Main peaks in PC2:negative: 1707.0, 1760.4, 1211.0, 1388.0

#Main peaks in PC3:positive:1725.2, 1760.4 ,1211.0, 1924.4 1391.9

PC1 Loading

PC2 Loading

PC3 Loading

10 feb 2014

Spectra Reconstruction (Adding the Mean Spectra to the Centered ones)

In the post "Using R: NIPALS Spectra Reconstruction ", we reconstruct the original Matrix from the T (score matrix) and the "P" (loadings matrix). Knowing X (the original matrix), was easy to calculate the error matrix "E".

But we have reconstructed the X matrix centeres.

Now we have to add the mean spectrum of the training data set to every spectrum in the X matrix, to have the real spectra.

I called "Xtest" to the matrix : T.t(P) + E, and the spectra is exactly the same than the "Xorig", but centered, because NIPALS center the matrix to develop the algorithm.

Now is simply to add the mean spectrum (sflw.msc3.tra.cent) to every spectrum vector of the "Xorig" matrix.

We can start by the first one:
Xorig_1<-(Xtest[1,]) + sflw.msc3.tra.cent
so the spectrum change from this plot:
matplot(wave.NIR,Xorig[1,],type="l",lty=1,xlab="nm",
+ ylab="log 1/R")

to this one:
matplot(wave.NIR,Xorig_1,type="l",lty=1,xlab="nm",ylab="log 1/R")

9 feb 2014

Using R: NIPALS Spectra Reconstruction

This is a simple exercise. I start wit some expectra in the NIR zone, and firs I wan to run the PCA with the NIPALS algorithm from the package "Chemometrics".
library(chemometrics)

The name of the spectra I´m going to use to develop the Principal Components Analysis is: sflw.msc3.tra$NIRmsc

Now with the NIPALS function:
sflw.msc3.tra_nipals<-nipals(sflw.msc3.tra$NIRmsc,a=10,it=160)
names(sflw.msc3.tra_nipals)

The result fom names are two matrices: T (scores matrix) and P (loading matrix).I´m going to give them another name:
T_nipals<-sflw.msc3.tra_nipals$T
P_nipals<-sflw.msc3.tra_nipals$P

Which are the dimensions of these matrices?
> dim(P_nipals)
[1] 2800   10
> dim(T_nipals)
[1] 75 10
2800 is the number of wavelenths of the spectra
10 is the number of PCs
75 is the number of samples

We can call Xorig to the original matrix (75 rows, 2800 columns)
from our "sflw.msc3.tra$NIRmsc".

As we know:   Xorig = Xrecon + E
beeing             Xrecon = T * t(P)
E is the Residual Matrix.

In R we have to proceed like this:
Xrecon<-T_nipals%*%(t(P_nipals))
Xorig<-scale(sflw.msc3.tra$NIRmsc,center=TRUE,scale=FALSE)
Xorig is the original spectra centered.

So, the Error matrix is:
E<-Xorig-Xrecon

Now we can check that:
Xorig==Xrecon+E
gives a TRUE response for all the values

5 feb 2014

Analyzing Forage Samples (Advices from Robin Malm)

As Robin says we will get always better results drying the samples, because we can make it more homogeneous, after grinding, anyway there is a bigger need to analyze the forage fresh, to get quicker results and decisions.

When working with fresh forage samples (due to the high moisture content), it is important:

1: Take a representative sample.

2: Homogenize the sample correctly.

3: Place it in a large cup configured to acquire different subsamples

4: As soon as you scan the sample, analyze it in the Lab for reference data (the same placed in the large cup) and link the lab values to the spectra.

5: Include the variability of temperature in the model.Temperature in change along the year, so in this case a repeatibility file can help to make the model more stable to this changes.

6: Of course, the quality of the Lab analysis is very important.

4 feb 2014

Statistics for NIR Network Monitoring (SST, SSW and SSB)

I want to share this videos from the Khan Acedemy, because the SST (Total Sum of Squares), SSW (Sum of Squares Within) and SSB (Sum of Square Between) are part of other which will help to develop other formulas trying to understand how a NIR Network performs (evaluating the predicted results from diferent standardized NIR instruments in the Network).

I will continue with this in future posts