R & Chemometrics: febrero 2012

28 feb 2012

CAC - 2012: Excel Chemometrics training

My idea for the future of this blog is to include posts about Chemometrics for Excel, learning from the tutorials from the Russian Chemometrics Society.

I saw this, very interesting, event for this summer:

There are several pre-conference trainings, and one of them, very interesting for this blog in particular:

http://www.cac2012.mke.org.hu/pre-conference-courses.html

Really recommended

PCA for NIR Spectra_part 006: "Mahalanobis"

Outliers have an important influence over the PCs, for this reason they must be detected and examinee.

We have just the spectra without lab data, and we have to check if any of the sample spectra is an outlier ( a noisy spectrum, a sample which belongs to other population,……., an extreme sample for a particular constituent,..).

One way to detect outlier is the “Mahalanobis distance”. And we are going to calculate the Mahalanobis distance to the center of the population.

Previously we have calculated with the NIPALS algorithm the T score matrix, for three PCs.

Let´s calculate now the Mahalanobis distance (Library: Chemometrics) for a certain probability.

> Moutlier(T3pc_nipals,quantile =0.975, plot=TRUE)

$md

[1] 3.4506943 2.0677806 1.8554593 1.3493939 1.0093176 1.4537839 1.7177865

[8] 0.8551981 1.9936152 1.7136093 1.6755040 1.0370122 0.6980203 1.0457686

[15] 1.9107813 2.4836284 1.8483303 1.6226210 1.7422189 2.3278939 2.9407373

[22] 1.0270963 0.8200505 1.3194672 1.0637166 0.7383098 0.9658263 1.1470021

$cutoff

[1] 3.057516

We can see how the sample number “1” is out of the cutoff ratio.

PCA for NIR Spectra_part 005: "Reconstruction"

We saw how to plot the raw spectra (X), how to calculate the mean spectrum, how to center the sprectra (subtracting the mean spectrum from every spectra of the original matrix X). After that we have developed the PCAs with the NIPALS algorithm, getting two matrices: T (scores) and P (loadings).

We have to decide the number of PCs, looking to the plots, or to the numbers (explained variance).

Depending of the numbers of PCs, these matrices will have more or less columns.

With these two matrices we can reconstruct again the X centered matrix, but we´ll get also a residual matrix “E”.

X_c= T.P^t+E

This post just shows this in R:

> P3pc_nipals<-P_nipals[,1:3]

> tP3pc_nipals<-t(P3pc_nipals)

> Xc3pc_reconst<-T3pc_nipals

> Xc3pc_reconst<-T3pc_nipals%*%tP3pc_nipals

> matplot(wavelengths,t(Xc3pc_reconst),lty=1,

+ pch=1,xlab="data_points",ylab="log(1/R)")

> resid3pc<-Xc- Xc3pc_reconst

> matplot(wavelengths,t(resid3pc),lty=1,

+ pch=1,xlab="data_points",ylab="log(1/R)")

We can see the plots of the X centered matrix reconstructed and the plot representing the residual variance or Error matrix “E”. If we add the mean spectrum to every spectra of the centered matrix we will get the X matrix reconstructed.

26 feb 2012

PCA for NIR Spectra_part 004: "Projections"

This plot in 2D, help us to decide the number of PCs, it is easy to create in R, once we have discompose the X matrix into a P matrix (loadings) and a T matrix (scores).

For this plot, we just need the T matrix.

> CPs<-seq(1,10,by=1)

> matplot(CPs,t(Xnipals$T),lty=1,pch=21,

+ xlab="PC_number",ylab="Explained_Var")

Every dot for every vertical line represents the score of a sample for that particular PC. We made the NIPALS calculations for 10 PCs. Every vertical line represents the projections of the samples over that particular PC. The score of a sample for that PC is the distance to the mean.

We can calculate for every PC, the standard deviation for all the scores and the variance.

As we see the firsts 2 PCs represents almost all the variance, and for the rest the projections are becoming narrower.

This plot is good to select how many components to choose, and also to detect outliers, extreme samples,.....

25 feb 2012

PCA for NIR Spectra_part 003: "NIPALS"

> X<-yarn$NIR

> X_nipals<-nipals(X,a=10,it=100)

Two matrices are generated (P and T)

As in other posts, we are going to look to the loadings & scores, for firsts three principal components:

> wavelengths<-seq(1,268,by=1)

> matplot(wavelengths,X_nipals$P[,1:3],lty=1,

+ pch=21,xlab="data_points",ylab="log(1/R)")

> T3cp<-X_nipals$T[,1:3]

> pairs(T3cp)

In the following plot, I compare the loadings plots for the first 3 PCs calculated with SVD (up) and with NIPALS (down):

We can see how the first PC (PC1) has the same shape for both. The other two (PC2 & PC3) has also the same shape, but inverted).

Let´s compare the scores plots:

Red dots for NIPALS, black for SVD:

23 feb 2012

PCA for NIR Spectra_part 002: "Score planes"

The idea of this post is to compare the score plots for the first 3 principal components obtained with the algorithm “svd” with the scores plot of other chemometric software (Win ISI in this case). Previously I had exported the yarn spectra to this software.

Let´s first use the command "pairs", to see in “R” the score plots for the first 3 PCs:

> T<-Xc_svd$u %*% diag(Xc_svd$d)

> T3cp<-T[,1:3]

> pairs(T3cp)

I only choose two planes to compare, but the rest of the planes have the same shapes as the calculated with the “”svd” in “R”. (Blue background: Win ISI score planes).

22 feb 2012

PCA for NIR Spectra_part 001: "Plotting the loadings"

There are different algorithms to calculate the Principal Components (PCs). Kurt Varmuza & Peter Filzmozer explain them in their book: “Introduction to Multivariate Statistical Analysis in Chemometrics”.

I´m going to apply one of them, to the Yarn spectra.

Previously we have to center the X matrix, let´s call it Xc.

> Xc<-scale(yarn$NIR, center = TRUE, scale = FALSE)

The algorithm I´m going to apply is “Singular Value Decomposition”.

> Xc_svd<-svd(Xc)

The idea of this post is just to look to the loadings matrix (P). Loadings are spectra. which reconstruct together with the score matrix (T), and an error matrix (E), the original X matrix.

For this case 3 components in enough, because explain almost 99% of the variance, so let´s have a look to the first three loadings:

> P<-Xc_svd$v

> P3cp<-P[,1:3]

> matplot(wavelengths,P3cp,pch=21,lty=1,

+ xlab="data_points(nm)",ylab="log(1/R)")

21 feb 2012

Exporting Spectra from "R" to other Softwares

This is a single exercise but it is a good practice to check important points about "Import/Export".
The idea was to export the spectra (Yarn) from R to a TXT file. There are different ways to do it, but I used:
> yarn_NIR<-yarn$NIR

> sink("C:\\yarn_NIR.txt")

> yarn_NIR

> sink()

Now other software like in my case Win ISI, have the option to "Import" this data and convert it into their own formats. We can use other software´s like Unscrambler, ...., and of course Excel.
I want to work a little bit with this interaction in future posts. We can use other software to treat the spectra with derivatives (1º,2ª,3ª,4ª) or any other treatment, and import it back in R.
This is the Yarn NIR spectra in Win ISI:

19 feb 2012

Standard Normal Variate (SNV)

This is another pretreatment used quite often in Near Infrared to remove the scatter. It is applied to every spectrum individually.

The average and standard deviation of all the data points for that spectrum is calculated. The average value is substracted from the absorbance for every data point and the result is divided by the standard deviation.
"R" has a function to center and scale every vector which we can use to get the SNV spectrum. Let´s apply this function to our known Yarn NIR data.
> library(pls)
> data(yarn)

> X<-yarn$NIR

> Xt<-t(X)

> Xt_snv<-scale(Xt,center=TRUE,sd.scale=TRUE)

> wavelengths<-seq(1,268,by=1)

> matplot(wavelengths,(Xt_snv),lty=1,pch=21,

+ xlab="data_points(nm)",ylab="log(1/R)")

This is the raw spectra without any treatment

If we apply the SNV treatment the spectra change a little bit to remove some scatter:

If you like this, see also:
Applying "SNV + Detrend" to the spectra with R

17 feb 2012

Plotting the “Mean Spectrum”

Mean spectrum calculation is important:

To center a matrix of spectra, we subtract the mean spectrum, from every spectrum in the matrix.

There are also many options to use the mean spectrum, like average subsamples.

Let´s calculate and plot the mean spectra for the Yarn NIR Data:

> yarn_mean<-colMeans(yarn$NIR)

> wavelength<-c(1:268)

> matplot(wavelength,yarn_mean,lty=1,pch=21,col="red",

+ xlab="data_points",ylab="mean spec.")

We can see in the following plot the X matrix centered:

16 feb 2012

NIR "Cross Validaton Statistics" with "R"

We have to check different options before to decide for one model:

Configure different cross validations.

Configure different math treatments.

Configure number of terms.

With the Yarn NIR data, I have develop 4 models, for a simple exercise.

Of course we can check many combinations.

As math treatment I choose the raw spectra and the spectra treated with MSC. Other treatment as SNV, derivatives... can be applied.

>gas1.loo<-plsr(octane~(NIR),ncomp=10,

+ data=gasTrain,validation="LOO")

>gas1msc.loo<-plsr(octane~msc(NIR),

+ ncomp=10,data=gasTrain,validation="LOO")

>gas1.cv5<-plsr(octane~(NIR),ncomp=10,

+ data=gasTrain,validation="CV",

+ segment.type="consecutive",length.seg=5)

>gas1msc.cv5<-plsr(octane~msc(NIR),ncomp=10,

+ data=gasTrain,validation="CV",

+ segment.type="consecutive",length.seg=5)

Now let´s have a look to the statistics:

We have the cross validation error (CV) for each model for the different number of components (terms), and the CV error bias corrected (adjCV).

Which is the best option?

We have to take in account some tips.

We don´t know the error of the reference method, but in theory the Model Error cannot be better than the "laboratory error" for this constituent.

A Model works normally better with fewer terms (more robust). It does not make any good to improve a few, in the cross validation statistic, by adding one or more terms.

It´s good to have an independent test set apart from the CV to decide the best option.

Cross Validation is an important tool to decide the number of terms and not commit “over-fitting” or “under-fitting”.

As we can see there are not a great improvement applying MSC in this case, that is an indication that there are little scatter in this data.

15 feb 2012

"NIR Std. Dev. Spectra" with "R"

It is always good to look at the spectra from different points of view, before to develop a regression, this will help us to understand better our samples, to detect outliers, to check where the variability is, if that variability correlates with the constituent of interest (directly or inverse),…..

Chemometric software’s have the tools to do it.

A single tool to check our spectra is the standard deviation spectrum where every data point has the standard deviation value for the absorbances at every wavelength,. This SD spectrum is a nice way to see where the variability is, so we can understand better the importance of the wavelengths in other calculations as the principal components, loadings,….

In the previous post we have seen how to apply MSC (Multiple Scatter Correction) to the NIR Spectra (Yarn) with "R".

To understand better the effect of this math treatment, we can have a look to the standard deviation spectra.

For this post, the idea is a simple exercise to over plot the “standard deviation spectra” of the Yarn NIR Data (from the PLS package) "with and without" the MSC (Multiple Scatter Correction) applied.

> sd_spec<-sd(yarn$NIR)

> yarn_msc<-msc(yarn$NIR)

> sd_mscspec<-sd(yarn_msc)

> sd_overplot<-cbind(sd_spec,sd_mscspec)

> wavelength<-c(1:268)

> matplot(wavelength,sd_overplot,lty=1,pch=21,

+ xlab="data_points",ylab="std dev")

12 feb 2012

"R" PLS Package: Multiple Scatter Correction (MSC)

MSC (Multiple Scatter Correction) is a Math treatment to correct the scatter in the spectra. The scatter is produced for different physical circumstances as particle size, packaging.

Normally scatter make worse the correlation of the spectra with the constituent of interest.

Almost all the chemometric software’s available include this math treatment and of course “R” have it as well in the “PLS Package”.

Following the Journal of Statistical Software (January 2007, Volume 18, Issue 2) there is a nice tutorial about how to use the PLS Package (Bjorn-Helge Mevik & Ron Wehrens).

The idea of this post is just compare graphically the spectra without any treatment and with the MSC treatment using the Near Infrared Data “yarn” from “R”.

> yarn

> wavelengths<-seq(1,268,by=1)

> matplot(wavelengths,t(yarn$NIR),lty=1,pch=21

+,xlab="data_points(nm)",ylab="log(1/R)")

This is the spectra of the 28 samples without any treatment:

> Ztrain<-msc(yarn$NIR)

> colnames(Ztrain)<-c(1:268)

> matplot(wavelengths,t(Ztrain),lty=1,pch=21,

+ xlab="data_points(nm)",ylab="log(1/R)")

This is the spectra of the 28 samples with MSC applied:

We can see the differences in the log(1/R) scale (Y axis).

10 feb 2012

"NIR-Quimiometria" with "R - Bloggers" .

“NIR Quimiometria”, join the “R “community represented by: “R-Bloggers”.

It is really an important event for this Blog.

9 feb 2012

"R": Predicting a Test Set (Gasoline)

> data(gasoline)
> #60 spectra of gasoline (octane is the constituent)
> #We divide the whole Set into a Train Set and a Test Set.
> gasTrain<-gasoline[1:50,]
> gasTest<-gasoline[51:60,]
> #Let´s develop the PLSR with the Tain Set and LOO CV
> gas1<-plsr(octane~NIR,ncomp=10,data=gasTrain,validation="LOO")
> summary(gas1)
Data:   X dimension: 50 401
        Y dimension: 50 1
Fit method: kernelpls
Number of components considered: 10
VALIDATION: RMSEP
Cross-validated using 50 leave-one-out segments.
       (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV           1.545    1.357   0.2966   0.2524   0.2476   0.2398   0.2319
adjCV        1.545    1.356   0.2947   0.2521   0.2478   0.2388   0.2313
       7 comps 8 comps 9 comps 10 comps
CV      0.2386   0.2316   0.2449    0.2673
adjCV   0.2377   0.2308   0.2438    0.2657
TRAINING: % variance explained
        1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
X         78.17    85.58    93.41    96.06    96.94    97.89    98.38    98.85
octane    29.39    96.85    97.89    98.26    98.86    98.96    99.09    99.16
        9 comps 10 comps
X         99.02     99.19
octane    99.28     99.39
> #For this exercice we decide 3 components
> #Let´s predict our Test Set with this 3 components Model.
> predict(gas1,ncomp=3,newdata=gasTest)
, , 3 comps     octane
51 87.94907
52 87.30484
53 88.21420
54 84.86945
55 85.24244
56 84.57502
57 87.37650
58 86.78971
59 89.10282
60 86.97223
> #To Plot these data:
>predplot(gas1,ncomp=3,newdata=gasTest,asp=1,line=TRUE)

> #Let´s look to the RMSEP Statistic.This is very nice tool to decide if 3 components is fine or we can choose more or less components.
> RMSEP(gas1,newdata=gasTest)
(Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps
     1.5369       1.1696       0.2445       0.2341       0.3287       0.2780
    6 comps      7 comps      8 comps      9 comps     10 comps
     0.2703       0.3301       0.3571       0.4090       0.6116
> #It´s fine, we can also consider to choose only two.The RMSEP is 0,234.
> #The CV for the Model with 3 components was: 0,252.
> #Really R is a wonderful tool to develop regressions, and to    understand better all what is behind the algorithms.
> #We can get a lot of literature on internet to start working with R.
> #Thanks to Bjorn-Helge Mevik & Ron Wehres for their good   tutorials about the PLS Package, they help me to understand better this program and to continue learning,(I have ordered some books).

8 feb 2012

Adrian Belew: Progressive Guitar Player

It´s allways nice to hear Adrian.
I do quite often.

"R": PLS Regression (Gasoline) - 005

Let´s see know how to plot the scores for the 3 PLS Components:

We can see the explained variance from each component in the diagonal.

We can get it from R with:

> explvar(gas1)

Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6

70.9656438 7.5943956 7.5871843 9.2537926 0.7201960 0.8472951

Comp 7 Comp 8 Comp 9 Comp 10

0.3538649 0.7810986 0.2184760 0.3878373

We can change “scores” for “ loadings”, and get the plot of the 3 loadings together:

or one by one:

We can plot also the regression coefficients spectrum, or to see the values in numbers:

> coef(gas1,comp=3)

octane

900 nm 0.1491498386

902 nm 0.1468577061

904 nm 0.1463829295

906 nm 0.1478201762

908 nm 0.1511718253

910 nm 0.1521591068

912 nm 0.1756188138

914 nm 0.1561727797

916 nm 0.1580512737

918 nm 0.1342321544

920 nm 0.1140515422

922 nm 0.1154623941

.....................

Bibliography:

Tutorials PLS Package for "R" :

Bjorn-Helge Mevik

Norwegian University of Life Sciences

Ron Wehrens

Radboud University Nijmegen

R Help Web page

7 feb 2012

"R": PLS Regression (Gasoline) - 004

In the previous post we plot the Cross Validation predictions with:

> plot(gas1, ncomp = 3, asp = 1, line = TRUE)

We can plot the fitted values instead with:

> plot(gas1, ncomp = 3, asp = 1, line = TRUE,which=train)

Graphics are different:

Of course, using "train" we get overoptimisc statistics and we should look better at the Cross Validation or to an independant test set for validation.
We decided 3 components to develop the PLS Regressions looking to the RMSEP plot. We can use other plots as the MSEP plot (changing RMSEP for MSEP), or to the RSQ plot.

> plot(R2(gas1),legendpos = "topright")

We can see how after the 3th component becomes almost flat.

We can see it better with numbers:

> R2(gas1)

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps

-0.03419 0.23374 0.93684 0.97111 0.97474 0.97474

6 comps 7 comps 8 comps 9 comps 10 comps

0.97713 0.97914 0.97742 0.97453 0.97413

We can see also the residuals in "R" for the different number of component (1, 2,...,10). In these values the calculation for the statistics are based.
These are the residual values for the PLSR model with 3 components:

octane

1 0.100769634

2 0.369121232

3 0.251715938

4 -0.209263557

5 -0.473107996

6 0.158305081

7 0.080218313

8 -0.141445641

9 -0.099252992

10 -0.077775217

11 0.561603527

12 0.488456018

13 -0.023514480

14 -0.106796820

15 -0.015477061

16 0.010451476

17 0.547102944

18 0.215613857

19 -0.290225797

20 0.238646916

21 0.115224011

22 -0.219819205

23 -0.040436420

24 -0.313450043

25 -0.161174139

26 0.065222607

27 0.032299933

28 -0.120728914

29 -0.394899511

30 -0.116389549

31 -0.242168963

32 -0.100928743

33 -0.003314534

34 0.152746720

35 0.092815472

36 0.029039668

37 0.020761125

38 0.339468953

39 0.019163788

40 0.192727538

41 -0.077437540

42 -0.267717370

43 0.161465598

44 0.101965851

45 -0.022411411

46 -0.322253768

47 -0.272445813

48 -0.151183595

49 0.063073375

50 -0.001254795

51 0.008358151

52 0.297159695

53 0.015659145

54 0.033326901

55 -0.141411827

56 0.280361574

57 -0.491022823

58 -0.332150710

59 0.269220723

60 -0.082606528

Bibliography:

Tutorials PLS Package for "R" :

Bjorn-Helge Mevik

Ron Wehrens

R Help Web page

Radboud University Nijmegen

Norwegian University of Life Sciences