31 mar 2012
30 mar 2012
Comparing Spectra with different math treatments
It is amazing the quantity of graphics you can develop with R, and how you can show and manage these graphics. Here in the same plot we compare the raw demo spectra, treated with MSC, SNV and with the first derivative (differences between consecutive data points). We can see how SNV and MSC look similar, but if you look to the Y axis, they are different.
In the first derivative spectra there is a big peak of noise at 1100 nm. At this wavelength the detector change from “Si” to “PbSn”, so the difference in absorbance’s is quite high (from 1098 to 1100), so we have to exclude this area if we develop the calibration with the full spectrum range.
Using: par(mfrow=c(2,2)) opens a window, where we plot our spectra with the four different treatments.
29 mar 2012
Dividing the Sample Set in two (Validation & Training)
We have in the Demo sample set “66” samples. In this post we´ll see one way to divide the set in two parts: one for “Validation” and another for Training or Calibration.
The selection will be random. And we are going to use the command: “sample”. I decided to select 10 samples for validation, and the rest for training.
demo_raw_val<-demo_raw[sample(66,10),]
If you repeat this sentence several times, you will get different sets every time.
In my case the samples selected are:
Samples: 25,50,8,49,39,12,16,63,35 y 41
These samples are in rows, and we have to create a training set removing them:
demo_raw_train<-demo_raw[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]
We will create the same sample sets for the other data frame with math treatments:
demo_msc_train<-demo_msc[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]
demo_snv_train<-demo_snv[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]
demo_msc_val<-demo_msc[c(25,50,8,49,39,12,16,63,35,41),]
demo_snv_val<-demo_msc[c(25,50,8,49,39,12,16,63,35,41),]
It is important to look to the summary of the sample sets to check and compare the statistics for the different constituents.
Or to look to the distribution plots, like in this case for moisture:
28 mar 2012
SCRIPT for NIR "DEMO" Tutorial - 001
########### RAW NIR Demo Data##################################
demo<-read.table("demo.txt",header=TRUE)
VIS<-demo[,6:355]
VIS<-as.matrix(VIS)
NIR<-demo[,356:1055]
NIR<-as.matrix(NIR)
ALL<-demo[,6:1055]
ALL<-as.matrix(ALL)
Protein<-demo[,1]
Fat<-demo[,2]
Ash<-demo[,3]
DM<-demo[,4]
Moisture<-demo[,5]
demo_raw<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),
+ Moisture=I(Moisture),VIS=I(VIS),NIR=I(NIR),ALL=I(ALL))
wave_vis<-seq(400,1098,by=2)
wave_nir<-seq(1100,2498,by=2)
wave_all<-seq(400,2498,by=2)
matplot(wave_vis,t(demo_raw$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_nir,t(demo_raw$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_raw$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
########### MSC (Multiple Scatter Correction) ##########################
library(ChemometricsWithR)
VIS_msc<-msc(NIR)
VIS_msc<-as.matrix(VIS_msc)
NIR_msc<-msc(NIR)
NIR_msc<-as.matrix(NIR_msc)
ALL_msc<-msc(ALL)
ALL_msc<-as.matrix(ALL_msc)
demo_msc<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),
+ Moisture=I(Moisture),VIS=I(VIS_msc),NIR=I(NIR_msc),ALL=I(ALL_msc))
matplot(wave_vis,t(demo_msc$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_nir,t(demo_msc$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_msc$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_nir,t(demo_msc$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_msc$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
########### SNV (Standard Normal Variate )####################################
tVIS<-t(VIS)
tNIR<-t(NIR)
tVISsnv<-scale(tVIS,center=TRUE,scale=TRUE)
tNIRsnv<-scale(tNIR,center=TRUE,scale=TRUE)
VIS_snv<-t(tVISsnv)
VIS_snv<-as.matrix(VIS_snv)
NIR_snv<-t(tNIRsnv)
NIR_snv<-as.matrix(NIR_snv)
tALL<-t(ALL)
tALLsnv<-scale(tALL,center=TRUE,scale=TRUE)
ALL_snv<-t(tALLsnv)
ALL_snv<-as.matrix(ALL_snv)
demo_snv<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),
+ Moisture=I(Moisture),VIS=I(VIS_snv),NIR=I(NIR_snv),ALL=I(ALL_snv))
+ Moisture=I(Moisture),VIS=I(VIS_snv),NIR=I(NIR_snv),ALL=I(ALL_snv))
matplot(wave_nir,t(demo_snv$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_vis,t(demo_snv$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_snv$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
27 mar 2012
Hyperspec "R" package: Future Plans
Interesting Information from Tal Galini (R Bloggers)
Project
|
Description
|
Improve R’s capabilities of working with spectroscopic and hyperspectral data
|
hyperSpec: Future plans
Link to general R GSoC 2012 page
Summary: hyperSpec is a package for working with spectra (like NMR, Infrared, Raman, Fluorescence, UV/VIS, X-ray diffraction) Links: http://hyperspec.r-forge.r-project.org/ and hyperspec
25 mar 2012
24 mar 2012
22 mar 2012
Export from ISI Scan / Importing into "R"
Some software’s like ISI Scan (Infrasoft International) have a nice feature to export in an ASCII file the spectral data. Every time we analyze a sample in routine, one file with the spectral data of that sample is sent to a folder we have configure previously.
This post was written after checking this point and after that I have imported the spectrum into “R”.
To configure the product to export spectra:
21 mar 2012
Win ISI (Demo.CAL) imported in "R"
I like to work with “R” (with the different chemometrics packages) and compare the results with other software’s, like in this case Win ISI.
Win ISI has a demo file, acquired in an instrument with a wavelength range from 400-2498 nm.
Spectra can be exported from Win ISI and imported into “R” with “read.table”, for example.
Data frame is generated and the raw spectra are ready to run Chemometrics.
20 mar 2012
Savitzky-Golay in "VISION"
I had been explaining how to develop this nice pre-treatment in “R” (see previous posts):
Savitzky-Golay filters in "R"
Applying Savitzky-Golay filters in “R”
Savitzky-Golay filters in "R"
Applying Savitzky-Golay filters in “R”
, but let’s see how a commercial Chemometric Software as VISION, show the configuration window:
As we can see we have to configure the same parameters than in “R”, Matlab, or any other software which contains SG filters.
Applying Savitzky-Golay filters in “R”
When applying SG, we select a moving average window with an odd value “n” for the number of data points. SG fit a polynomial of “p” degree to this data points and give the value to the central point (this is the reason to have an odd value).
We apply also an smooth in the case of “m” = 0, or the first (m=1), second (m=2) or third (m=3) derivatives.
Apply a Savitzky-Golay smoothing filter
Description
Smooth data with a Savitzky-Golay smoothing filter.
Usage
sgolayfilt(x, p = 3, n = p + 3 - p%%2, m = 0, ts = 1)
## S3 method for class 'sgolayFilter'
filter(filt, x, ...)
Arguments
x | signal to be filtered. |
p | filter order. |
n | filter length (must be odd). |
m | return the m-th derivative of the filter coefficients. |
ts | time scaling factor. |
filt | filter characteristics (normally generated by sgolay). |
Playing with these three parameter we change the shape of the spectra.
Example:
The upper spectra was taken with p=2, n=7 and m=2
The lower one with p=2, n=31 and m=2
As we can see the size of the window has a big influence on the resolution of the peaks.
How to apply it in R:
(For the lower spectrum)
>sgolay_2_31_2<-apply(fattyac$NITm,1,sgolayfilt,p=2,n=31,m=2)
>matplot(wavelengths,sgolay_2_31_2,lty=1,pch=21,
+ xlab="nm",ylab="SG_2_31_2_abs")
Tweet
17 mar 2012
Savitzky-Golay filters in "R"
Derivatives are a good way to extract information to our spectra. As we know NIR contents overlapping bands, and spectra must be treated with math operations in order to extract as much information as possible and to correlate it with the constituent of interest in the case of quantitative analysis or to discriminate two similar products.
Derivatives are one of those math treatments.
With “R”, we can apply a Savitzky Golay smoothing filter to apply derivatives to our spectra, and to select the order of the derivative.
Let´s have a look first to the R help:
Apply a Savitzky-Golay smoothing filter
Description
Smooth data with a Savitzky-Golay smoothing filter.
Usage
sgolayfilt(x, p = 3, n = p + 3 - p%%2, m = 0, ts = 1)
## S3 method for class 'sgolayFilter'
filter(filt, x, ...)
Arguments
x | signal to be filtered. |
p | filter order. |
n | filter length (must be odd). |
m | return the m-th derivative of the filter coefficients. |
ts | time scaling factor. |
filt | filter characteristics (normally generated by sgolay). |
... | additional arguments (ignored). |
Let’s apply the default values for SG filter to the “fattyac” data (you can do the same for the “yarn” data or “gasoline” data).
But first remember how this data looks before any math treatment was applied:
Now let’s give to m, the values 1 (First Derivative) and 2 (Second Derivative)
Here we have a powerful tool, to continue progressing doing Chemometrics with R.
Bibliography:
Chemometrics with R (Ron Wehrens)
R help page
Tweet
16 mar 2012
15 mar 2012
14 mar 2012
NIT: Fatty acids study in R - Part 007
Once we have chosen the model, we can continue acquiring spectra of new samples. Spectra is exported to a txt or csv file and we imported in R to be reprocessed.
We use the function “predict” from the PLS package. I have done this with 20 new samples. We need first to apply to them the adequate math treatment (same as the used in the model). I call this sample set for prediction “fatt2ac_val”, after apply the “msc” math treatment.
So, let´s see the predictions:
If this data has more than the spectra, (the Lab values) we can also validate and to check the number of terms to use.
We can see our RMSE error and compare it with the RMSEP obtained in the PLSR “LOO validation statistics” which was 0,5733.We can see also that we would get even lower values for validation with a lower number of components (10).
It seems that the is working almost as expected, but let´s have a look to the plots:
We use the function “predict” from the PLS package. I have done this with 20 new samples. We need first to apply to them the adequate math treatment (same as the used in the model). I call this sample set for prediction “fatt2ac_val”, after apply the “msc” math treatment.
So, let´s see the predictions:
> predict(C16_0,ncomp=12,newdata=fatty2ac_val)
, , 12 comps
C16_0
220 22.01807
221 20.44803
222 19.79991
223 21.64232
224 20.29058
225 20.20099
226 21.52053
227 19.83305
228 18.95492
229 21.39239
230 21.11044
231 20.67454
232 19.28662
233 20.97292
234 21.70614
235 20.27464
236 19.70897
237 21.30686
238 20.21069
239 19.21576
I have used the model with 12 terms. If this data has more than the spectra, (the Lab values) we can also validate and to check the number of terms to use.
> RMSEP(C16_0,newdata=fatty2ac_val,ncomps=12)
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
1.6280 1.5987 1.2554 1.1071 1.3447 0.9122
6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
0.8754 0.8413 0.6597 0.6154 0.5669 0.5791
12 comps 13 comps 14 comps 15 comps 16 comps
0.5935 0.6261 0.5994 0.6315 0.5787
It seems that the is working almost as expected, but let´s have a look to the plots:
>predplot(C16_0,ncomp=7:15,newdata=fatty2ac_val,
I can see a Bias, of almost 0,50.
The error corrected by the Bias would be 0.34.
Bias can be due to diferent reasons (temperature, sample presentation,particle size,optical path,state of the instrument,...).
This samples are fat triturated as a paste, and put it in a petri dish, and place it in an instrument in transmitance.
The next step would be to add this data to the data base and recalibrate.
Divide the data into a Calibration and a Validation Set. Be sure that in the validation set there are some samples of this last set.
The idea with all this series of post related to this data set was to work with data I used almost daily in my job, and I wanted to see how to proceed in R, and once you get use to it the results are very good for the understanding of chemometrics for multivariate data. I will continue exporting some of my data sets to work in R. The analysis of Fatty Acids by NIR/NIT can give good results for some of them (C16:0, C18:0, C18:1,..).
12 mar 2012
NIT: Fatty acids study in R - Part 006
In one of the columns, for constituent C16_0, one sample (57) has a value of “zero” (we could see this in the histogram).The reason for that is that the laboratory did not supply this value. The PLS regression will consider the lab value as cero, so we will get a plot like this:
I observed also that the sample 219 has a high residual for the regressions of all the constituents, so I decided to remove these two samples from the sample set in order to continue, and to develop the models.
I am starting with R, so I will appreciate if you add comments in order to do this task in a simpler way.
I create two sample sets, in order to remove these two samples (219 and 57):
Now we have to take the decision of how many terms to choose. Let´s see the validation plot with 7 and 12 components (terms).
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.
I observed also that the sample 219 has a high residual for the regressions of all the constituents, so I decided to remove these two samples from the sample set in order to continue, and to develop the models.
I am starting with R, so I will appreciate if you add comments in order to do this task in a simpler way.
I create two sample sets, in order to remove these two samples (219 and 57):
> fattyac1<-fattyac_msc[1:56,]
> fattyac2<-fattyac_msc[58:218,]
and I combined this three sets again:
> fattyac_msc1<-rbind(fattyac1,fattyac2)
Well, I can develop my regression now:
Now we have to take the decision of how many terms to choose. Let´s see the validation plot with 7 and 12 components (terms).
plot(C16_0,ncomp=7,which="validation")
It is clear that the decision to choose one model or the other will have a great influence in the predictions. We need a validation set to make a better decision. But I think that it will work better with 12 terms.
It will be important, if possible to find samples with C16:0 values bellow 18 to add to our database in order to develop a better model.
Another decision could be to keep out this extreme sample until we find more, but we can decide to leave it, in order to extrapolate better in this zone.
It is important not to have unique samples in the model. In this case we have one. We have to consider this.
It will be important, if possible to find samples with C16:0 values bellow 18 to add to our database in order to develop a better model.
Another decision could be to keep out this extreme sample until we find more, but we can decide to leave it, in order to extrapolate better in this zone.
It is important not to have unique samples in the model. In this case we have one. We have to consider this.
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.
9 mar 2012
NIT: Fatty acids study in R - Part 005 (Calibration)
There are several algorithms to run a PLS regression (I recommend to consult the books: “Introduction to Multivariate Analysis in Chemometrics - Kurt Varmuza & Peter Filzmozer” and “Chemometrics with R – Ron Wehrens”).
We are going to use the PLS package, and we are going to develop, maybe the constituent which looks more promising: Oleic Acic (C18:1).
Of course we are going to use the MSC pretreatment. For vcross validation we are going to use “leave one out”.
I decide a maximum of 16 PLS terms.
plot(C18_1_reg,"validation",estimate="CV")
abline(v=12,col="red")
plot(C18_1_reg,ncomp=12,"prediction")
abline(0,1,col="red")
We can see how after Term 12 the RMSEP increase.
We can see in the XY plot, how we have a few samples (like the sample 219) with a high residual (probably wrong lab value in some cases). If you prefer to see the residual plot, where you can see sample 219 in the upper right corner:
If we check the residual list (C18_1_reg$residuals),the extreme sample 66, fits well in the model, so we decide to keep it. No we can remove sample 219 from the sample training set removing this row from the data frame.
C18_1_reg<- plsr(C18_1~NITmsc,ncomp = 16,
+ data =fattyac1_msc,validation = "LOO")
We repeat steo [1] again, but thid time we change "fattyas_msc" for "fattyas1_msc", and look to the plots and statistics again. In this case I overplot the plots with the outlier (red colour) and without the outlier (Blue):
We keep this last model and we will test it in the future with new samples (independent test set).
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.
We are going to use the PLS package, and we are going to develop, maybe the constituent which looks more promising: Oleic Acic (C18:1).
Of course we are going to use the MSC pretreatment. For vcross validation we are going to use “leave one out”.
I decide a maximum of 16 PLS terms.
[1] C18_1_reg<- plsr(C18_1~NITmsc, ncomp = 16,data =fattyac_msc,
+ validation = "LOO")
Looking to the Cross Validation statistics, it seems that 12 is the best number of terms to use, anyway lets see the plots.
par(mfrow=c(1,2))+ validation = "LOO")
plot(C18_1_reg,"validation",estimate="CV")
abline(v=12,col="red")
plot(C18_1_reg,ncomp=12,"prediction")
abline(0,1,col="red")
We can see in the XY plot, how we have a few samples (like the sample 219) with a high residual (probably wrong lab value in some cases). If you prefer to see the residual plot, where you can see sample 219 in the upper right corner:
If we check the residual list (C18_1_reg$residuals),the extreme sample 66, fits well in the model, so we decide to keep it. No we can remove sample 219 from the sample training set removing this row from the data frame.
C18_1_reg<- plsr(C18_1~NITmsc,ncomp = 16,
+ data =fattyac1_msc,validation = "LOO")
We repeat steo [1] again, but thid time we change "fattyas_msc" for "fattyas1_msc", and look to the plots and statistics again. In this case I overplot the plots with the outlier (red colour) and without the outlier (Blue):
We keep this last model and we will test it in the future with new samples (independent test set).
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.
7 mar 2012
NIT: Fatty acids study in R - Part 004
> fattyac_msc[66,1:6]
C16_0 C16_1 C18_0 C18_1 C18_2 C18_3
66 15.8 2 6 62.3 10.2 0.6
Let´s compare with the summary
> summary(fattyac_msc)
C16_0 C16_1
Min. : 0.00 Min. :1.500
1st Qu.:20.10 1st Qu.:2.000
Median :21.00 Median :2.200
Mean :21.34 Mean :2.267
3rd Qu.:22.90 3rd Qu.:2.500
Max. :26.00 Max. :3.500
C18_0 C18_1
Min. : 5.800 Min. :43.80
1st Qu.: 8.600 1st Qu.:51.95
Median : 9.400 Median :54.50
Mean : 9.711 Mean :53.93
3rd Qu.:10.500 3rd Qu.:56.15
Max. :14.000 Max. :62.30
C18_2 C18_3
Min. : 5.500 Min. :0.3000
1st Qu.: 7.600 1st Qu.:0.5000
Median : 8.500 Median :0.6000
Mean : 8.503 Mean :0.6032
3rd Qu.: 9.100 3rd Qu.:0.7000
Max. :14.700 Max. :1.3000
Sample 66 has the higher value for C18:1 (oleic acid), but it is not isolated in the histogram. For some reasons this sample differs from the others especially from 100 to 1050 nm. We will wait forward to take a decision about this sample.
Until now we have been managing with the X matrix.
Now we start to study the Y matrix. First thing to do is to have a look to the summary, and of course to the histograms.
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.
Until now we have been managing with the X matrix.
Now we start to study the Y matrix. First thing to do is to have a look to the summary, and of course to the histograms.
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.
> hist(C16_0,col="red")
> hist(C16_1,col="blue")
> hist(C18_0,col="green")
> hist(C18_1,col="brown")
> hist(C18_2,col="violet")
> hist(C18_3,col="orange")
Suscribirse a:
Entradas (Atom)