Outliers have an important influence over the PCs, for this reason they must be detected and examinee.
We have just the spectra without lab data, and we have to check if any of the sample spectra is an outlier ( a noisy spectrum, a sample which belongs to other population,……., an extreme sample for a particular constituent,..).
One way to detect outlier is the “Mahalanobis distance”. And we are going to calculate the Mahalanobis distance to the center of the population.
Previously we have calculated with the NIPALS algorithm the T score matrix, for three PCs.
Let´s calculate now the Mahalanobis distance (Library: Chemometrics) for a certain probability.
We saw how to plot the raw spectra (X), how to calculate the mean spectrum, how to center the sprectra (subtracting the mean spectrum from every spectra of the original matrix X). After that we have developed the PCAs with the NIPALS algorithm, getting two matrices: T (scores) and P (loadings).
We have to decide the number of PCs, looking to the plots, or to the numbers (explained variance).
Depending of the numbers of PCs, these matrices will have more or less columns.
With these two matrices we can reconstruct again the X centered matrix, but we´ll get also a residual matrix “E”.
Xc = T.Pt+E
This post just shows this in R:
> resid3pc<-Xc- Xc3pc_reconst
We can see the plots of the X centered matrix reconstructed and the plot representing the residual variance or Error matrix “E”. If we add the mean spectrum to every spectra of the centered matrix we will get the X matrix reconstructed.
This plot in 2D, help us to decide the number of PCs, it is easy to create in R, once we have discompose the X matrix into a P matrix (loadings) and a T matrix (scores).
For this plot, we just need the T matrix.
Every dot for every vertical line represents the score of a sample for that particular PC. We made the NIPALS calculations for 10 PCs. Every vertical line represents the projections of the samples over that particular PC. The score of a sample for that PC is the distance to the mean.
We can calculate for every PC, the standard deviation for all the scores and the variance.
As we see the firsts 2 PCs represents almost all the variance, and for the rest the projections are becoming narrower.
This plot is good to select how many components to choose, and also to detect outliers, extreme samples,.....
The idea of this post is to compare the score plots for the first 3 principal components obtained with the algorithm “svd” with the scores plot of other chemometric software (Win ISI in this case). Previously I had exported the yarn spectra to this software.
Let´s first use the command "pairs", to see in “R” the score plots for the first 3 PCs:
> T<-Xc_svd$u %*% diag(Xc_svd$d)
I only choose two planes to compare, but the rest of the planes have the same shapes as the calculated with the “”svd” in “R”. (Blue background: Win ISI score planes).
There are different algorithms to calculate the Principal Components (PCs). Kurt Varmuza & Peter Filzmozer explainthem in their book: “Introduction to Multivariate Statistical Analysis in Chemometrics”.
I´m going to apply one of them, to the Yarn spectra.
Previously we have to center the X matrix, let´s call it Xc.
> Xc<-scale(yarn$NIR, center = TRUE, scale = FALSE)
The algorithm I´m going to apply is “Singular Value Decomposition”.
The idea of this post is just to look to the loadings matrix (P). Loadings are spectra. which reconstruct together with the score matrix (T), and an error matrix (E), the original X matrix.
For this case 3 components in enough, because explain almost 99% of the variance, so let´s have a look to the first three loadings:
This is a single exercise but it is a good practice to check important points about "Import/Export".
The idea was to export the spectra (Yarn) from R to a TXT file. There are different ways to do it, but I used:
Now other software like in my case Win ISI, have the option to "Import" this data and convert it into their own formats. We can use other software´s like Unscrambler, ...., and of course Excel.
I want to work a little bit with this interaction in future posts. We can use other software to treat the spectra with derivatives (1º,2ª,3ª,4ª) or any other treatment, and import it back in R.
This is the Yarn NIR spectra in Win ISI:
This is another pretreatment used quite often in Near Infrared to remove the scatter. It is applied to every spectrum individually.
The average and standard deviation of all the data points for that spectrum is calculated. The average value is substracted from the absorbance for every data point and the result is divided by the standard deviation.
"R" has a function to center and scale every vector which we can use to get the SNV spectrum. Let´s apply this function to our known Yarn NIR data. > library(pls) > data(yarn)
This is the raw spectra without any treatment
If we apply the SNV treatment the spectra change a little bit to remove some scatter:
It is always good to look at the spectra from different points of view, before to develop a regression, this will help us to understand better our samples, to detect outliers, to check where the variability is, if that variability correlates with the constituent of interest (directly or inverse),…..
Chemometric software’s have the tools to do it.
A single tool to check our spectra is the standard deviation spectrum where every data point has the standard deviation value for the absorbances at every wavelength,. This SD spectrum is a nice way to see where the variability is, so we can understand better the importance of the wavelengths in other calculations as the principal components, loadings,….
In the previous post we have seen how to apply MSC (Multiple Scatter Correction) to the NIR Spectra (Yarn) with "R".
To understand better the effect of this math treatment, we can have a look to the standard deviation spectra.
For this post, the idea is a simple exercise to over plot the “standard deviation spectra” of the Yarn NIR Data (from the PLS package) "with and without" the MSC (Multiple Scatter Correction) applied.
> data(gasoline) > #60 spectra of gasoline (octane is the constituent)
> #We divide the whole Set into a Train Set and a Test Set. > gasTrain<-gasoline[1:50,]
> gasTest<-gasoline[51:60,] > #Let´s develop the PLSR with the Tain Set and LOO CV > gas1<-plsr(octane~NIR,ncomp=10,data=gasTrain,validation="LOO") > summary(gas1) Data: X dimension: 50 401
Y dimension: 50 1
Fit method: kernelpls
Number of components considered: 10 VALIDATION: RMSEP
Cross-validated using 50 leave-one-out segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 1.545 1.357 0.2966 0.2524 0.2476 0.2398 0.2319
adjCV 1.545 1.356 0.2947 0.2521 0.2478 0.2388 0.2313
7 comps 8 comps 9 comps 10 comps
CV 0.2386 0.2316 0.2449 0.2673
adjCV 0.2377 0.2308 0.2438 0.2657 TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
X 78.17 85.58 93.41 96.06 96.94 97.89 98.38 98.85
octane 29.39 96.85 97.89 98.26 98.86 98.96 99.09 99.16
9 comps 10 comps
X 99.02 99.19
octane 99.28 99.39 > #For this exercice we decide 3 components
> #Let´s predict our Test Set with this 3 components Model. > predict(gas1,ncomp=3,newdata=gasTest) , , 3 comps octane
60 86.97223 > #To Plot these data: >predplot(gas1,ncomp=3,newdata=gasTest,asp=1,line=TRUE)
> #Let´s look to the RMSEP Statistic.This is very nice tool to decide if 3 components is fine or we can choose more or less components. > RMSEP(gas1,newdata=gasTest) (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
1.5369 1.1696 0.2445 0.2341 0.3287 0.2780
6 comps 7 comps 8 comps 9 comps 10 comps
0.2703 0.3301 0.3571 0.4090 0.6116 > #It´s fine, we can also consider to choose only two.The RMSEP is 0,234.
> #The CV for the Model with 3 components was: 0,252.
> #Really R is a wonderful tool to develop regressions, and to understand better all what is behind the algorithms.
> #We can get a lot of literature on internet to start working with R.
> #Thanks to Bjorn-Helge Mevik & Ron Wehres for their good tutorials about the PLS Package, they help me to understand better this program and to continue learning,(I have ordered some books). Tweet
In the previous post we plot the Cross Validation predictions with:
> plot(gas1, ncomp = 3, asp = 1, line = TRUE)
We can plot the fitted values instead with:
> plot(gas1, ncomp = 3, asp = 1, line = TRUE,which=train)
Graphics are different:
Of course, using "train" we get overoptimisc statistics and we should look better at the Cross Validation or to an independant test set for validation.
We decided 3 components to develop the PLS Regressions looking to the RMSEP plot. We can use other plots as the MSEP plot (changing RMSEP for MSEP), or to the RSQ plot.
> plot(R2(gas1),legendpos = "topright")
We can see how after the 3th component becomes almost flat.
We can see it better with numbers:
(Intercept)1 comps2 comps3 comps4 comps5 comps
6 comps7 comps8 comps9 comps10 comps
We can see also the residuals in "R" for the different number of component (1, 2,...,10). In these values the calculation for the statistics are based.
These are the residual values for the PLSR model with 3 components: