31 mar 2012

VIDEO: "R" Checking the reference values ("Y" Matrix).

30 mar 2012

Comparing Spectra with different math treatments

It is amazing the quantity of graphics you can develop with R, and how you can show and manage these graphics. Here in the same plot we compare the raw demo spectra, treated with MSC, SNV and with the first derivative (differences between consecutive data points). We can see how SNV and MSC look similar, but if you look to the Y axis, they are different.

In the first derivative spectra there is a big peak of noise at 1100 nm. At this wavelength the detector change from “Si” to “PbSn”, so the difference in absorbance’s is quite high (from 1098 to 1100), so we have to exclude this area if we develop the calibration with the full spectrum range.

Using: par(mfrow=c(2,2)) opens a window, where we plot our spectra with the four different treatments.

29 mar 2012

Dividing the Sample Set in two (Validation & Training)

We have in the Demo sample set “66” samples. In this post we´ll see one way to divide the set in two parts: one for “Validation” and another for Training or Calibration.

The selection will be random. And we are going to use the command: “sample”. I decided to select 10 samples for validation, and the rest for training.

demo_raw_val<-demo_raw[sample(66,10),]

If you repeat this sentence several times, you will get different sets every time.

In my case the samples selected are:

Samples: 25,50,8,49,39,12,16,63,35 y 41

These samples are in rows, and we have to create a training set removing them:

demo_raw_train<-demo_raw[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]

We will create the same sample sets for the other data frame with math treatments:

demo_msc_train<-demo_msc[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]

demo_snv_train<-demo_snv[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]

demo_msc_val<-demo_msc[c(25,50,8,49,39,12,16,63,35,41),]

demo_snv_val<-demo_msc[c(25,50,8,49,39,12,16,63,35,41),]

It is important to look to the summary of the sample sets to check and compare the statistics for the different constituents.

Or to look to the distribution plots, like in this case for moisture:

28 mar 2012

SCRIPT for NIR "DEMO" Tutorial - 001

########### RAW NIR Demo Data##################################

demo<-read.table("demo.txt",header=TRUE)

VIS<-demo[,6:355]

VIS<-as.matrix(VIS)

NIR<-demo[,356:1055]

NIR<-as.matrix(NIR)

ALL<-demo[,6:1055]

ALL<-as.matrix(ALL)

Protein<-demo[,1]

Fat<-demo[,2]

Ash<-demo[,3]

DM<-demo[,4]

Moisture<-demo[,5]

demo_raw<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),

+ Moisture=I(Moisture),VIS=I(VIS),NIR=I(NIR),ALL=I(ALL))

wave_vis<-seq(400,1098,by=2)

wave_nir<-seq(1100,2498,by=2)

wave_all<-seq(400,2498,by=2)

matplot(wave_vis,t(demo_raw$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

matplot(wave_nir,t(demo_raw$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

matplot(wave_all,t(demo_raw$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

########### MSC (Multiple Scatter Correction) ##########################

library(ChemometricsWithR)

VIS_msc<-msc(NIR)

VIS_msc<-as.matrix(VIS_msc)

NIR_msc<-msc(NIR)

NIR_msc<-as.matrix(NIR_msc)

ALL_msc<-msc(ALL)

ALL_msc<-as.matrix(ALL_msc)

demo_msc<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),

+ Moisture=I(Moisture),VIS=I(VIS_msc),NIR=I(NIR_msc),ALL=I(ALL_msc))

matplot(wave_vis,t(demo_msc$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_nir,t(demo_msc$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_msc$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

########### SNV (Standard Normal Variate )####################################

tVIS<-t(VIS)

tNIR<-t(NIR)

tVISsnv<-scale(tVIS,center=TRUE,scale=TRUE)

tNIRsnv<-scale(tNIR,center=TRUE,scale=TRUE)

VIS_snv<-t(tVISsnv)

VIS_snv<-as.matrix(VIS_snv)

NIR_snv<-t(tNIRsnv)

NIR_snv<-as.matrix(NIR_snv)

tALL<-t(ALL)

tALLsnv<-scale(tALL,center=TRUE,scale=TRUE)

ALL_snv<-t(tALLsnv)

ALL_snv<-as.matrix(ALL_snv)

demo_snv<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),
+ Moisture=I(Moisture),VIS=I(VIS_snv),NIR=I(NIR_snv),ALL=I(ALL_snv))

matplot(wave_nir,t(demo_snv$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

matplot(wave_vis,t(demo_snv$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

matplot(wave_all,t(demo_snv$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

27 mar 2012

Hyperspec "R" package: Future Plans

Interesting Information from Tal Galini (R Bloggers)

Google summer of code 2012 – and R – a call for students

Project	Description
hyperSpec	Improve R’s capabilities of working with spectroscopic and hyperspectral data

hyperSpec: Future plans

Link to general R GSoC 2012 page

Summary: hyperSpec is a package for working with spectra (like NMR, Infrared, Raman, Fluorescence, UV/VIS, X-ray diffraction) Links: http://hyperspec.r-forge.r-project.org/ and hyperspec

VIDEO: Applying "SNV" math-treatment to our raw spectra in "R".

25 mar 2012

VIDEO: Applying "MSC" math-treatment to our raw spectra in "R".

24 mar 2012

VIDEO:Looking to "NIR" Spectra in "R": (Import and organize)

22 mar 2012

Export from ISI Scan / Importing into "R"

Some software’s like ISI Scan (Infrasoft International) have a nice feature to export in an ASCII file the spectral data. Every time we analyze a sample in routine, one file with the spectral data of that sample is sent to a folder we have configure previously.

This post was written after checking this point and after that I have imported the spectrum into “R”.

To configure the product to export spectra:

After some preparation of the file and some work in "R", we can see and work with the spectra. We have seen in previous post how to read file into R with for example: "read.table".

21 mar 2012

Win ISI (Demo.CAL) imported in "R"

I like to work with “R” (with the different chemometrics packages) and compare the results with other software’s, like in this case Win ISI.

Win ISI has a demo file, acquired in an instrument with a wavelength range from 400-2498 nm.

Spectra can be exported from Win ISI and imported into “R” with “read.table”, for example.

Data frame is generated and the raw spectra are ready to run Chemometrics.

20 mar 2012

Excel: "Vincular una Matriz a un rango de celdas"

Savitzky-Golay in "VISION"

I had been explaining how to develop this nice pre-treatment in “R” (see previous posts):
Savitzky-Golay filters in "R"
Applying Savitzky-Golay filters in “R”

, but let’s see how a commercial Chemometric Software as VISION, show the configuration window:

As we can see we have to configure the same parameters than in “R”, Matlab, or any other software which contains SG filters.

Applying Savitzky-Golay filters in “R”

When applying SG, we select a moving average window with an odd value “n” for the number of data points. SG fit a polynomial of “p” degree to this data points and give the value to the central point (this is the reason to have an odd value).

We apply also an smooth in the case of “m” = 0, or the first (m=1), second (m=2) or third (m=3) derivatives.

Apply a Savitzky-Golay smoothing filter

Description

Smooth data with a Savitzky-Golay smoothing filter.

Usage

sgolayfilt(x, p = 3, n = p + 3 - p%%2, m = 0, ts = 1)

## S3 method for class 'sgolayFilter'

filter(filt, x, ...)

Arguments

x	signal to be filtered.
p	filter order.
n	filter length (must be odd).
m	return the m-th derivative of the filter coefficients.
ts	time scaling factor.
filt	filter characteristics (normally generated by sgolay).

Playing with these three parameter we change the shape of the spectra.

Example:

The upper spectra was taken with p=2, n=7 and m=2

The lower one with p=2, n=31 and m=2

As we can see the size of the window has a big influence on the resolution of the peaks.

How to apply it in R:

(For the lower spectrum)

>sgolay_2_31_2<-apply(fattyac$NITm,1,sgolayfilt,p=2,n=31,m=2)

>matplot(wavelengths,sgolay_2_31_2,lty=1,pch=21,

+ xlab="nm",ylab="SG_2_31_2_abs")

17 mar 2012

Savitzky-Golay filters in "R"

Derivatives are a good way to extract information to our spectra. As we know NIR contents overlapping bands, and spectra must be treated with math operations in order to extract as much information as possible and to correlate it with the constituent of interest in the case of quantitative analysis or to discriminate two similar products.

Derivatives are one of those math treatments.

With “R”, we can apply a Savitzky Golay smoothing filter to apply derivatives to our spectra, and to select the order of the derivative.

Let´s have a look first to the R help:

Apply a Savitzky-Golay smoothing filter

Description

Smooth data with a Savitzky-Golay smoothing filter.

Usage

sgolayfilt(x, p = 3, n = p + 3 - p%%2, m = 0, ts = 1)

## S3 method for class 'sgolayFilter'

filter(filt, x, ...)

Arguments

x	signal to be filtered.
p	filter order.
n	filter length (must be odd).
m	return the m-th derivative of the filter coefficients.
ts	time scaling factor.
filt	filter characteristics (normally generated by sgolay).
...	additional arguments (ignored).

Let’s apply the default values for SG filter to the “fattyac” data (you can do the same for the “yarn” data or “gasoline” data).

But first remember how this data looks before any math treatment was applied:

Now let’s give to m, the values 1 (First Derivative) and 2 (Second Derivative)

Here we have a powerful tool, to continue progressing doing Chemometrics with R.

Bibliography:

Chemometrics with R (Ron Wehrens)

R help page

16 mar 2012

VIDEO: Matrices en Excel_parte: 002

15 mar 2012

VIDEO: Matrices en Excel_parte: 001

14 mar 2012

NIT: Fatty acids study in R - Part 007

Once we have chosen the model, we can continue acquiring spectra of new samples. Spectra is exported to a txt or csv file and we imported in R to be reprocessed.
We use the function “predict” from the PLS package. I have done this with 20 new samples. We need first to apply to them the adequate math treatment (same as the used in the model). I call this sample set for prediction “fatt2ac_val”, after apply the “msc” math treatment.
So, let´s see the predictions:

> predict(C16_0,ncomp=12,newdata=fatty2ac_val)

, , 12 comps

C16_0

220 22.01807

221 20.44803

222 19.79991

223 21.64232

224 20.29058

225 20.20099

226 21.52053

227 19.83305

228 18.95492

229 21.39239

230 21.11044

231 20.67454

232 19.28662

233 20.97292

234 21.70614

235 20.27464

236 19.70897

237 21.30686

238 20.21069

239 19.21576

I have used the model with 12 terms.
If this data has more than the spectra, (the Lab values) we can also validate and to check the number of terms to use.

> RMSEP(C16_0,newdata=fatty2ac_val,ncomps=12)

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps

1.6280 1.5987 1.2554 1.1071 1.3447 0.9122

6 comps 7 comps 8 comps 9 comps 10 comps 11 comps

0.8754 0.8413 0.6597 0.6154 0.5669 0.5791

12 comps 13 comps 14 comps 15 comps 16 comps

0.5935 0.6261 0.5994 0.6315 0.5787

We can see our RMSE error and compare it with the RMSEP obtained in the PLSR “LOO validation statistics” which was 0,5733.We can see also that we would get even lower values for validation with a lower number of components (10).
It seems that the is working almost as expected, but let´s have a look to the plots:

>predplot(C16_0,ncomp=7:15,newdata=fatty2ac_val,

+ asp=1,line=TRUE)

I can see a Bias, of almost 0,50.

The error corrected by the Bias would be 0.34.

Bias can be due to diferent reasons (temperature, sample presentation,particle size,optical path,state of the instrument,...).

This samples are fat triturated as a paste, and put it in a petri dish, and place it in an instrument in transmitance.

The next step would be to add this data to the data base and recalibrate.

Divide the data into a Calibration and a Validation Set. Be sure that in the validation set there are some samples of this last set.

The idea with all this series of post related to this data set was to work with data I used almost daily in my job, and I wanted to see how to proceed in R, and once you get use to it the results are very good for the understanding of chemometrics for multivariate data. I will continue exporting some of my data sets to work in R. The analysis of Fatty Acids by NIR/NIT can give good results for some of them (C16:0, C18:0, C18:1,..).

12 mar 2012

NIT: Fatty acids study in R - Part 006

In one of the columns, for constituent C16_0, one sample (57) has a value of “zero” (we could see this in the histogram).The reason for that is that the laboratory did not supply this value. The PLS regression will consider the lab value as cero, so we will get a plot like this:

I observed also that the sample 219 has a high residual for the regressions of all the constituents, so I decided to remove these two samples from the sample set in order to continue, and to develop the models.
I am starting with R, so I will appreciate if you add comments in order to do this task in a simpler way.
I create two sample sets, in order to remove these two samples (219 and 57):

> fattyac1<-fattyac_msc[1:56,]

> fattyac2<-fattyac_msc[58:218,]

and I combined this three sets again:

> fattyac_msc1<-rbind(fattyac1,fattyac2)

Well, I can develop my regression now:

Now we have to take the decision of how many terms to choose. Let´s see the validation plot with 7 and 12 components (terms).

plot(C16_0,ncomp=7,which="validation")

abline(0,1,col="red")

plot(C16_0,ncomp=12,which="validation")

abline(0,1,col="red")

It is clear that the decision to choose one model or the other will have a great influence in the predictions. We need a validation set to make a better decision. But I think that it will work better with 12 terms.
It will be important, if possible to find samples with C16:0 values bellow 18 to add to our database in order to develop a better model.
Another decision could be to keep out this extreme sample until we find more, but we can decide to leave it, in order to extrapolate better in this zone.
It is important not to have unique samples in the model. In this case we have one. We have to consider this.

If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

9 mar 2012

NIT: Fatty acids study in R - Part 005 (Calibration)

There are several algorithms to run a PLS regression (I recommend to consult the books: “Introduction to Multivariate Analysis in Chemometrics - Kurt Varmuza & Peter Filzmozer” and “Chemometrics with R – Ron Wehrens”).
We are going to use the PLS package, and we are going to develop, maybe the constituent which looks more promising: Oleic Acic (C18:1).
Of course we are going to use the MSC pretreatment. For vcross validation we are going to use “leave one out”.
I decide a maximum of 16 PLS terms.

[1] C18_1_reg<- plsr(C18_1~NITmsc, ncomp = 16,data =fattyac_msc,
+ validation = "LOO")

Looking to the Cross Validation statistics, it seems that 12 is the best number of terms to use, anyway lets see the plots.

par(mfrow=c(1,2))
plot(C18_1_reg,"validation",estimate="CV")
abline(v=12,col="red")
plot(C18_1_reg,ncomp=12,"prediction")
abline(0,1,col="red")

We can see how after Term 12 the RMSEP increase.
We can see in the XY plot, how we have a few samples (like the sample 219) with a high residual (probably wrong lab value in some cases). If you prefer to see the residual plot, where you can see sample 219 in the upper right corner:

If we check the residual list (C18_1_reg$residuals),the extreme sample 66, fits well in the model, so we decide to keep it. No we can remove sample 219 from the sample training set removing this row from the data frame.
C18_1_reg<- plsr(C18_1~NITmsc,ncomp = 16,
+ data =fattyac1_msc,validation = "LOO")
We repeat steo [1] again, but thid time we change "fattyas_msc" for "fattyas1_msc", and look to the plots and statistics again. In this case I overplot the plots with the outlier (red colour) and without the outlier (Blue):

We keep this last model and we will test it in the future with new samples (independent test set).

If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

7 mar 2012

NIT: Fatty acids study in R - Part 004

We already know that sample 66 has a MD of 11.6. Let´s see the values for the six constituents for this sample:

> fattyac_msc[66,1:6]

C16_0 C16_1 C18_0 C18_1 C18_2 C18_3

66 15.8 2 6 62.3 10.2 0.6

Let´s compare with the summary

> summary(fattyac_msc)

C16_0 C16_1

Min. : 0.00 Min. :1.500

1st Qu.:20.10 1st Qu.:2.000

Median :21.00 Median :2.200

Mean :21.34 Mean :2.267

3rd Qu.:22.90 3rd Qu.:2.500

Max. :26.00 Max. :3.500

C18_0 C18_1

Min. : 5.800 Min. :43.80

1st Qu.: 8.600 1st Qu.:51.95

Median : 9.400 Median :54.50

Mean : 9.711 Mean :53.93

3rd Qu.:10.500 3rd Qu.:56.15

Max. :14.000 Max. :62.30

C18_2 C18_3

Min. : 5.500 Min. :0.3000

1st Qu.: 7.600 1st Qu.:0.5000

Median : 8.500 Median :0.6000

Mean : 8.503 Mean :0.6032

3rd Qu.: 9.100 3rd Qu.:0.7000

Max. :14.700 Max. :1.3000

Sample 66 has the higher value for C18:1 (oleic acid), but it is not isolated in the histogram. For some reasons this sample differs from the others especially from 100 to 1050 nm. We will wait forward to take a decision about this sample.
Until now we have been managing with the X matrix.
Now we start to study the Y matrix. First thing to do is to have a look to the summary, and of course to the histograms.
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

> hist(C16_0,col="red")

> hist(C16_1,col="blue")

> hist(C18_0,col="green")

> hist(C18_1,col="brown")

> hist(C18_2,col="violet")

> hist(C18_3,col="orange")