R & Chemometrics: febrero 2015

22 feb 2015

Reducing the X matrix (Binning)

There are situations where we have spectra with a huge amount of data points and we want to reduce the number of them. There are instruments which measure spectra every 0.5 nm. This is an important advance in order to make instruments more similar one to the other in the wavelength scale, so the transferability is better, but this feature does not represent and advance when we are developing a model or calibration. In this case the normal procedure is to reduce the number of data points.

Continue using the “prospectr” package we have a function which can help us to do this task.

The steps I follow for this task, were:

1. I exported a CAL file from Win ISI (just the NIR segment) to Excel.

2. I copied the Matrix from Excel into the Clipboard.

3. I use this code to imported the X matrix into R:

hsoja.X<- as.matrix(read.table("clipboard",header=FALSE))

The dimension of this X matrix is: (159,2800), the number of samples is 159 and the number of wavelengths 2800. In the case that we have used a NIR instrument which acquires data points every 2 nm, the matrix would be (159,700), so the idea is to reduce the Matrix to this size.

plot(as.numeric(colnames(hsoja.X)),hsoja.X[1,],type="l",

+ xlab="wavelength",ylab="Absorbance")

4. Now we use the “binning” function to reduce the matrix, keeping one data points every a certain number (bin.size).In this case we will use a bin size of 4, in order to have 2800:4 = 700 data points.

hsoja.X.bin<- binning(hsoja.X,bin.size=4)

5. Now we can work with a more reduce X matrix.

15 feb 2015

Selecting samples for lab analysis (Part 2)

We have seen (Selecting samples for lab analysis - Part 1), how to select a wide and flat distribution of spectra from a wide amount, in order to send to the laboratory and develop a calibration according to our resources, but in case we can assume to send some more for the validation, we can use the function "duplex", from "prospectr package", and select at the same sample another set of samples "test" to evaluate the model.

We can use Euclidian or Mahalanobis distance for the calculation and the number of PCs that we consider.

library(prospectr)
dup<-duplex(X=X,k=20,metric="mahal",pc=3)
par(mfrow=c(3,1),ps=14)
plot(dup$pc[,1],dup$pc[,2],xlab="PC1",ylab="PC2")
points(dup$pc[dup$model,1],dup$pc[dup$model,2],pch=19,col="red")
points(dup$pc[dup$test,1],dup$pc[dup$test,2],pch=19,col="blue")
plot(dup$pc[,1],dup$pc[,3],xlab="PC1",ylab="PC3")
points(dup$pc[dup$model,1],dup$pc[dup$model,3],pch=19,col="red")
points(dup$pc[dup$test,1],dup$pc[dup$test,3],pch=19,col="blue")
plot(dup$pc[,2],dup$pc[,3],xlab="PC2",ylab="PC3")
points(dup$pc[dup$model,2],dup$pc[dup$model,3],pch=19,col="red")
points(dup$pc[dup$test,2],dup$pc[dup$test,3],pch=19,col="blue")

In "Red" are the samples selected for the Calibration, and in "Blue" the samples selected for the Validation, from different perspectives (maps of scores).

13 feb 2015

Selecting samples for lab analysis (Part 1)

Laboratory analysis for reference methods is expensive, and it will help a function which select the spectra well distributed all along the wavelength space, in the way that we get a flat distribution.

In the ”prospectr” package there are several functions which select the spectra based in their distribution in the PC space. One of these functions is the “Kennard-Stone” algorithm.

We can measure the distance in Mahalanobis or Euclidian distance.

An example can be that we have spectra of 1000 samples, but we can only afford to pay money to the laboratory for 50 or 100, so we can write in the function this number, and we get an output “model” with the selected samples.

ken_mahal<-kenStone(X=X,k=20,metric="mahal",pc=3)

plot(ken_mahal$pc[,1],ken_mahal$pc[,2],
+ xlab="PC1",ylab="PC2")
points(ken_mahal$pc[ken_mahal$model,1],
+ ken_mahal$pc[ken_mahal$model,2],pch=19,col=2)
plot(ken_mahal$pc[,1],ken_mahal$pc[,3],
+ xlab="PC1",ylab="PC3")
points(ken_mahal$pc[ken_mahal$model,1],
+ ken_mahal$pc[ken_mahal$model,3],pch=19,col=2)
plot(ken_mahal$pc[,2],ken_mahal$pc[,3],
+xlab="PC2",ylab="PC3")
points(ken_mahal$pc[ken_mahal$model,2],
+ ken_mahal$pc[ken_mahal$model,3],pch=19,col=2)

In these plot 20 samples selected and how well distributed are in the PC space

9 feb 2015

Moving average correction

It can happen that even acquiring enough subscans the spectra is still noisy, due to certain sample presentation limitations, so we can apply another technique to reduce the noise, the one called “moving average”. With this technique we select a window size (in this case 11) and the absorbance average of the 11 data points is given to the middle one. After we move one data point to the right and continue with the processing. Is for this reason that, in this case, we truncate the spectrum 5 points at the beginning and 5 points at the end.

See the results following this script:

library(prospectr)

X.movav<-movav(X.noisy,11)
plot(colnames(X),X.noisy[1,],type="l",col="red",
+ xlim=c(1100,2500),ylim=c(0.28,0.38),
+ xlab="wavelength",ylab="absorbance")
par(new=TRUE)
plot(colnames(X.movav),X.movav[1,],type="l",lwd=2,
+ xlim=c(1100,2500),ylim=c(0.28,0.38),
+ xlab="wavelength",ylab="absorbance",col="blue")
legend("topleft",legend=c("moving average","raw"),
+ lty=c(1,1),col=1:2)

Follow this tutorial

Adding noise to the spectra

Subscans and scan concept

8 feb 2015

Subscans and Scan concept

This post is part of a new tutotial to practice "Chemometrics with R", you can read the first part of the tutorial in the post " Adding Noise to the spectra". ".

Continuing with the tutorial, and using the NIRsoil spectra, this time I want to explain the concept of “Scan” and “Subscan”, being a Scan an average of certain number of Subscans (normally 32).
One subscan is normally noisy, and it is acquired, after a grating or an interferometer moves a complete cycle once, but normally we configure the software to take more scans and to get an average (can be 32 in the case of a grating instrument). After that we get an average of the 32 as the spectrum of the sample.
Every subscan is quite noisy, but because the noise is random, the average is much less noisy.
Let´s check with this simulation:

#We take one spectrum and repeated 32 times

subscan<-X[1,]

subscans<-rep(subscan,32)

subscans<-matrix(subscans,nrow=32,byrow=TRUE)

wavelength<-seq(1100,2498,by=2)

#Now we add different spectra noise to each

noise32<-noise[1:32,]

subscans<-subscans + noise32

matplot(wavelength,t(subscans),type="l",

+ xlab="wavelength",ylab="absorbance")

See in the figure the 32 subscans overplotted

#Now let´s do the average to get just one scan, the one we will use #as representative of the sample

subscan.avg<-as.matrix(colMeans(subscans))

matplot(wavelength,subscan.avg,type="l",

+ xlab="wavelength",ylab="absorbance")

7 feb 2015

Adding Noise to the spectra

Adding noise to the spectra: This case can help to develop tutorials about techniques to remove the noise, the package “prospectr” one of this techniques could be the one called “Moving average”, or “Running mean”. With this technique we trim the spectra a certain number of data points at the extremes (but we will seeit in other post).

In this post let´s add some noise to the “NIRsoil” data,

library(prospectr)

data(NIRsoil)
class(NIRsoil) # data.frame
names(NIRsoil) # "Nt" "Ciso" "CEC" "train" "spc"
head(NIRsoil$spc)# De 1100-2498 nm,every2nm
X<-as.matrix(NIRsoil$spc)
noise=rnorm(length(X),0,0.001) #(Generate noise)
#Constructing the Matrix:
noise<-matrix(noise,nrow=825,ncol=700)
plot(colnames(X),noise[1,],type="l",xlab="wavelength",
+ ylab="absorbance")

X.noisy<-X + noise
par(mfrow=c(2,1),ps=14)
plot(colnames(X),X[1,],type="l",xlab="wavelength",
+ ylab="absorbance",col="blue")
plot(colnames(X),X.noisy[1,],type="l",xlab="wavelength",
+ ylab="absorbance",col="red")

Up we can see the spectrum of the sample number in position 1, without noise added, and in the lower one the same spectrum with noise added.

1 feb 2015

Comparing PRCOMP and SVD for the eigenvalues calculation

PRCOMP calculates the Standard Deviation with the standard divisor (N-1), so in the output value “sdev”, we get the standard deviation of the column of the score matrix (n.a).

For example:

head(X1_prcomp$sdev)
3.21983981 0.54314465 0.41112799 0.08351649
0.06957348 0.02994683

The square of this values are the variances

head(X1_prcomp$sdev^2)
1.036737e+01 2.950061e-01 1.690262e-01 6.975004e-03
4.840470e-03 8.968124e-04

Using SVD, we get an output value “d”, with the square root of the eigenvalues. Using the same data:

head(X1_svd_d)
39.9571612 6.7402479 5.1019641 1.0364124 0.8633842 0.3716303

The square of these value are the “eigenvalues”:

head(X1_svd_d^2)
1596.5747310 45.4309414 26.0300382 1.0741506 0.7454323 0.1381091

In order to get the “eigenvalues” for the PRCOMP, we have to consider that they are divided by (N-1). In our case we have 155 samples, so (N-1) = 154.
head(X1_prcomp$sdev^2*(154))
1596.5747310 45.4309414 26.0300382 1.0741506 0.7454323 0.1381091

To see the % of explained variance we can use the value of each eigenvalue divided the sum of all of them

head(X1_svd_d^2/sum(X1_svd_d^2)*100)
95.59 2.72 1.559 0.064 0.045

This information is provided with the summary in PRCOMP:

summary(X1_prcomp)

Importance of components:
                             PC1        PC2       PC3       PC4       PC5
Standard deviation          3.2198   0.5431    0.41113   0.08352   0.06957
Proportion of Variance      0.9559    0.0272    0.01559   0.00064   0.00045
Cumulative Proportion      0.9559    0.9831    0.99873   0.99937   0.99982