R & Chemometrics: Principal Components

Mostrando entradas con la etiqueta Principal Components. Mostrar todas las entradas

24 ene 2022

PCA with the first derivative

In a previous post we had calculated the PCA with the math treatment SNV+Detrend and we calculated a first sample set of outliers with the Mahalanobis distance.

When calculating PCA we have to treat as best the spectra as possible in order to detect populations or boundaries and if we treat the spectra with math treatments which help to do this task is great. So indeed, to apply the SNV + Detrend, I apply this time the first derivative to the spectra (soil from Spanish soil from LUCAS database) and calculate the PCA.

We can have a look to the score’s maps (six PC recommended) to find if there are boundaries on them:

Look at the maps which include PC2 as one of the axes. There are a certain number of samples which takes a different direction than the rest. This second PC term can be useful to find something interesting on the spectra.

To see what is happening we can see the loading spectra for this second PC term:

This loading spectra has the first derivative math treatment applied, so we can compare it with a library of reference known spectra (minerals in this case) to see which is the best match, and in this case the best match is with the gypsum mineral, so this second term is explaining part of the variance included in our spectra database due to the addition of gypsum to the soil.

24 oct 2021

Modelling complex spectral data (soil) with the resemble package (VII)

This is the number 7 of the posts about the vignette " Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux) ", where we try to understand the way Resemble package works, and to use their functions to analyze complex products as the case of soil.

We continue from post 6, were we saw how the dissimilarity matrix is calculated in the principal component space (orthogonal space), and how in that space we can project new samples to get their scores, and calculate the dissimilarity matrix between the training sample set and the test sample set.

Now the idea is to select for every sample, from the test set, a certain number of training samples which are neighbors of that sample defining how many neighbors to choose (by "knn"). These selected training samples are taken apart and a new principal component space is calculated, calculating a new dissimilarity matrix with new values for the distances . In this new dissimilarity matrix will have NA values for the training samples which are not selected.

This is the part of the vignette called: "Combine k-nearest neighbors and dissimilarity measures in the orthogonal space" where, different examples choosing a "knn" value of 200, are developed using different PCA methods, so you can practice.

In the case of the first test set sample the histogram of the neighbors distances between of the sample itself and the rest of the training samples is:

we have chosen one test sample which is quite apart from the majority of the training samples , but you can try with other test set samples and you get different distributions.

Taking apart the 200 most closer samples and developing a new PCA, the NH distances of this sample to the rest is:

so we have developed a local PCA space with the closer 200 training samples to project the first sample of the test set getting new NH values. We can do this as well with PLS choosing the constituent of interest.

19 oct 2021

Modelling complex spectral data (soil) with the resemble package (VI)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

Imagine that just two Principal Components would be enough to explain 99% of the variance, so we can see the samples in a unique plane. It is easy to calculate the distance between a sample and all the rest, just drawing lines and calculating their distance. After that we can write their value in a matrix where the diagonal would be cero (distance between the sample and itself). In this case, it is the training set so we have 618 samples (618 dots) and the matrix would be a matrix with 618 rows and 618 columns (618x618).

We can see cases where the samples are very close (blue circles), so their neighbor distance is very small (very low values), and we can consider (we saw as well in previous post) that their constituents’ values would be very similar.

In the case that we have more components to explain the variance (11 as we saw in previous post), the dimension of the matrix would be the same (618x618), but the distances would be not in a plane, if not in a multidimensional space.

This matrix is called "dissimilarity matrix" in the vignette, and has a great importance in the development of calculations and algorithms.

If we project a test set in the PC space, we get the distance between every sample of the test set vs. every sample of the training set. In the case of our test set we have 207 samples, so we would get a 618 x 207 matrix with the distances and we can see if the testing samples are well represented or if they can expand and make more robust the database.

17 oct 2021

Modelling complex spectral data (soil) with the resemble package (V)

Continuing with the vignette Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

Continuing from last post: Once we have calculated the PC terms or components (11 in the case of the last PCA analysis using the method OPC), we define planes defined by the combinations of two of those terms (for example: PC1-PC2, PC2-PC3, PC1-PC3,…), and the training spectra is projected on the plane to get the scores of every spectrum vs. each PC component. All those scores are kept in a score matrix “T”. All the projections form a cloud that in the case of just two terms would be a 2D cloud, making easy the interpretations of the distances between every sample and the mean or their neighbors. But in the case of more dimensions it is a multivariate cloud, making the visual inspection more difficult, so we have to check the projections individually in 2D planes or 3D planes.

Algorithms like the Mahalanobis distance to the mean or to the neighbors will help us to check if the sample can be an outlier, it has very close neighbors (so it is represented by samples in theory similar), or if the sample has not closer neighbors and is a good sample to improve the structure of the database and make it more robust.

Let´s see in the case of the previous code one of those score planes, the one formed by the PC1 and PC2 terms:

plot(pca_tr_opc$scores[,1],pca_tr_opc$scores[,2],

xlim = c(min(pca_tr_opc$scores[,1]),

max(pca_tr_opc$scores[,1])),
ylim = c(min(pca_tr_opc$scores[,2]),

max(pca_tr_opc$scores[,2])),

xlab="PC1 ", ylab="PC2 ")

We can project the testing data on the same plane, getting the scores of the samples:

pca_projected <- predict(pca_tr_opc, newdata = testing$spc_p)

par(new=TRUE)

plot(pca_projected[,1],pca_projected[,2], col = "red",

xlim = c(min(pca_tr_opc$scores[,1]),

max(pca_tr_opc$scores[,1])),
ylim = c(min(pca_tr_opc$scores[,2]),

max(pca_tr_opc$scores[,2])),
xlab=" ", ylab=" ")

If we had only two PCs, this plane would be enough to show us the cloud, but we have 11 PCs. We can add graphically one more dimension (3D) and we see the cloud more clearly.

library(plotly)

T_training <- as.data.frame(pca_tr_opc$scores)
plot_ly(T_training, x=~T_training[,1], y=~T_training[,2],
z=~T_training[,3], alpha = 0.7)

Practice yourself and rotate the plot to different angles.

25 mar 2019

Reconstruction and RMS

Still working trying to get a protocol with R in a Notebook to detect adulteration or bad manufactured batches of a mixture.

It is important in the reconstruction the selection of the number of principal components. We get two matrices: T and P to reconstruct all the samples in the training set, so if we subtract from the real spectrum the reconstruction we get the residual spectrum.

These residual spectra may have information so we need to continue adding Principal Component terms until no information seems to be on them.
With new spectra batches we can project them on the PC space using the P matrix and get also their reconstructed spectra, and their residual spectra hoping to find patterns in the residual spectra which justify if they are bad batches.

This is the case of some of this batches shown in red over the blue residuals from the training data:

One way to measure the noise and to decide if the samples in red are bad batches respect the training samples is the statistic RMS. I overplot the RMS in blue for the training samples and in red for the test (in theory bad samples). The plot show that some of the test samples have higher RMS values than the training set.
A cutoff value can be fit in order to determine this in routine.

21 mar 2019

Overploting residual spectra of Training and Test sets (Good Product)

After we have develop a Prediction Model with a certain number of Principal Components, there is always a residual matrix spectra with the noise not explained by the Model. Of course we can add or reduce the number of PCs, but we can overfit or underfit the model increasing the noise in the model or leaving interested variance in the Residual Matrix.

This residual matrix is normally called "E".

Is interesting to look to this matrix, but specially for detection of adulterants, mistakes in the proportions of a mixture or any other difference between the validation samples (in this case in theory bad samples) and the training matrix residuals.

In this case I overplot both for a model with 5 PCs (in red the validation samples residual spectra and in blue the training residual spectra).

We can see interesting patterns that we must study with more detail to answer some questions, about if the model is underfitted, if we see patterns enough to determine if the validation samples have adulterations or changes in the concentrations of the mixture ingredients and so on, or if there are for some reasons in the model samples that should have been considered as outliers and be taken out of the model.

20 mar 2019

Projecting bad batches over training PC space

Dear readers, along this night this blog will reach the 300.000 visits and I am happy about that. So thanks to all of you for visiting this blog.

Along the last posts I am writing about the idea to get a set of samples from several batches of a product which is a mixture of other products in a certain percentage. Of course the idea is to get an homogeneous product with the correct proportions of every product which takes part of the mixture.

Anyway there is variability in the ingredients of the mixture itself (different batches, seasons, origins, handling,..), and there are also uncertainty in the measuring of the quantities. It can be much worse if by mistake an ingredient is not added to the mixture or is confused by other.

So, to get a set with all the variability that can be allowed is important to determine if a product is correctly mixed or manufacturer.

In this plot I see a variability which I considered correct in a "Principal Component Space"

Over this PC Space we project other batches and we check if the projections falls into the limits set during the development of the PC Model. Of course it can appear new variability that we have to add to the model in a future update.

But to check it the model performs fine we have to test it with bad building batches, and this is the case in the next plot where we can see clear batches that are out of the limits (specially samples 1,2 and 3) with much more water than the samples in the training model.

We have to see the other samples much more in detail and to detect if the are wrong and the reason why.
So coming post about this matter soon.

19 mar 2019

Centrado e identificación de anómalos

11 mar 2019

Quimiometria con R: Preparando datos para Good Product

8 mar 2019

Quimiometria con R: Detectando anómalos por Mahalanobis

19 abr 2018

Projections over the planes in PC space

This is a plot in 3D to see the orthogonal projections over the plane formed by first and second PC calculated with SVD. Once projected over the plane, projections arte projected again on the new axis (all them: terms or loadings,.....) to calculate the score for every PC.

Plot can be moved with the mouse so find the better view.

Data is centered so we see samples on both sides of the plane.

6 may 2017

Checking the orthogonality of P (loadings) matrix

One of the values we got in the script of the post:"Tutorials with Resemble (Part 3 - orthoProjection) " was the loadings matrix (X.loadings), or what we called usually in this blog the P matrix.

One of the characteristics of the loadings “P” matrix, when we develop the PCA, is that if we multiply it by its transpose we get the Identity Matrix “I”

P<-X.loadings

Pt<-t(X.loadings)

P%*%Pt = I

In the “I” matrix, its diagonal is “1”, and “0” values for all the rest cells indicating that all the loadings are orthogonal between them.

Exercise:

Check it by yourself and take out the diagonal from the P matrix.
Represent in a graphic the first loadings:

1 vs 2 : a plane
1, 2 and 3: a cube

4 abr 2016

Looking to the Resemble score plots (2D and 3D)

It´s always nice (when testing a new package from the several chemometric packages), use other packages to get nice plots and information.

Now, that I am using again the Principal Components, is a good occasion to use packages as: "scatterplot3d", "rgl" or "Rcmdr" (you can download it, and install it from the CRAN Servers), to see the scores, and have better idea a.

In the previous pots we saw how to get the matrix of scores "T" in the principal components projections with Resemble: "pcProj$scores".

So now we just have to load scatterplot3d library:

> library(scatterplot3d)

and plot the scores for the first 3 PCs:

>scatterplot3d(pcProj$scores[,1:3], color = "blue",
              angle = 55, scale.y = 0.7, pch = 16)

After the plot appears:

We can see also the projections of the scores over the plane formed by PC1 and PC2:

scatterplot3d(pcProj$scores[,1:3], highlight.3d =TRUE,

     angle = 120,type="h",col.axis = "blue",

     col.grid = "lightblue",cex.axis = 1.3,

     cex.lab = 1.1,pch = 20)

 Now install the package "rgl" and load it:

library(rgl)
plot3d(pcProj$scores[,1:3], col="blue", size=3)

We can rotate with the mouse the cube in order to check better the population.

Another nice way to see the scores and projections in 3D is loading the package Rcmdr, and in the function "scatter3d", we select 3 columns of the T (scores) matrix. We can move it with the mouse in order to select the best side to look to the scores.

library(Rcmdr)
scatter3d(pcProj$scores[,1],pcProj$scores[,2],pcProj$scores[,3])

21 jun 2015

The loading matrix “P” : a good example of orthogonal matrix

We know that for an orthogonal matrix A:

A^t.A=A.A^t=I

When we calculate the loading matrix during the PCA process, each loading is orthogonal (perpendicular) to all others. So we can check for fun in R, Excel,…., this condition with the loading matrix.

P is a very large matrix, so we will check it with just a few columns (6 loadings or terms) and the same number of files (6 wavelengths):

> round(gas.loadings[1:6,1:6],digits=4)
          PC1   PC2   PC3    PC4   PC5    PC6
900 nm -0.011 0.022 0.034 -0.039 0.042 -0.020
902 nm -0.010 0.022 0.031 -0.041 0.039 -0.022
904 nm -0.011 0.022 0.030 -0.042 0.036 -0.021
906 nm -0.012 0.024 0.027 -0.045 0.031 -0.012
908 nm -0.013 0.021 0.025 -0.045 0.035 -0.013
910 nm -0.014 0.023 0.023 -0.046 0.036 -0.018

Pt is the transpose, so the columns are the wavelengths and the files the loadings:

> round(t(gas.loadings[1:6,1:6]),digits=4)
    900 nm 902 nm 904 nm 906 nm 908 nm 910 nm
PC1 -0.011 -0.010 -0.011 -0.012 -0.013 -0.014
PC2  0.022  0.022  0.022  0.024  0.021  0.023
PC3  0.034  0.031  0.030  0.027  0.025  0.023
PC4 -0.039 -0.041 -0.042 -0.045 -0.045 -0.046
PC5  0.042  0.039  0.036  0.031  0.035  0.036
PC6 -0.020 -0.022 -0.021 -0.012 -0.013 -0.018

Now we multiply the two matrix:

> round((gas.loadings[1:6,1:6])%*% solve((gas.loadings[1:6,1:6])),digits=4)
       900 nm 902 nm 904 nm 906 nm 908 nm 910 nm
900 nm      1      0      0      0      0      0
902 nm      0      1      0      0      0      0
904 nm      0      0      1      0      0      0
906 nm      0      0      0      1      0      0
908 nm      0      0      0      0      1      0
910 nm      0      0      0      0      0      1


> round(((solve(gas.loadings[1:6,1:6]))%*%(gas.loadings[1:6,1:6])),digits=4)
    PC1 PC2 PC3 PC4 PC5 PC6
PC1   1   0   0   0   0   0
PC2   0   1   0   0   0   0
PC3   0   0   1   0   0   0
PC4   0   0   0   1   0   0
PC5   0   0   0   0   1   0
PC6   0   0   0   0   0   1

27 may 2015

NIR used to study Mixtures (Part 1)

When developing a mixture of ingredients we use a to live some time to be sure that the mixture is fine and if we divide the mixed batch into several subsamples, the spectra of all the subsamples are similar each other in a certain level. If we see differences, we consider that the mixture is not completed and we can continue mixing. There is a point at a certain time where the mixture cannot be improve and after that point it can became worse to certain levels. We must be sure that we finish the mixture as closer to that point as possible.

If we get a library of spectra with those characteristics we can keep it and compare new batches in order to check if the spectrum at that time match with the spectra from the library.

In a mixture is also important that the quantities of every ingredient is correct, maybe we miss to add an ingredient, we add it in the wrong proportions, or by mistake an ingredient which does not form part of the formula. Why not to detect even a cross contaminant, or other issues.

If we develop a model based on good spectra, and prepare a realistic cutoff we can check all the batches and detect if the PASS or FAIL the Identification or Qualification Test.

There can be cases where we can know what the problem was looking to the spectra. One simple example is to add to a mixture an ingredient which should not be there... let´s suppose "Dextrose", and I acquire the spectrum of this mixture, after I subtract this spectrum from the average spectrum of all the good spectra in the library (let´s say the ideal spectrum)…..you will see how the difference spectrum is the Dextrose itself, or very similar with a high correlation to the samples in the Dextrose product.

You can see also this with more complex techniques as Principal Component Analysisn (PCAs), where we have a reconstructed spectrum and a residual spectrum as well. Can we see any features in the residual spectrum which tell us what is the ingredient or ingredients not explained by the model? Maybe is the spectra of an adulterant, wrong ingredient added by mistake,...

13 abr 2015

Using "Identify" function to check extreme variable weights

With the identify function we can check which wavelengths have more weights in the loadings.
Here I plot the first four loadings after calculating the PCs, using the "gasoline" data from the package "ChemometricsWithR".

This is the code to identify the weights in the case of the first one:
matplot(wavelengths,gas.loadings[,1],type="l",lty=1,col="red",
       +xlab="wavelength",ylab="log 1/R",
       +main="loading nº1")
identify(wavelengths,gas.loadings[,1],
         +labels=row.names(gas.loadings))

23 mar 2015

Mahalanobis in the PC space (removing redundant) - 2

We have seeing this plot in previous posts, but it is a good occasion to see them again if yo have read the previous post. The function Moutlier from the package chemometrics, have the option to see the outliers,, in the PC space with the normal covariance matrix and the robust covariance matrix. A line is defined for the cutoff value for a certain chi-square distribution, and the samples out are the outliers. Anyway in this example I am considering just two principal components, but when using more PCs, we give just one value for all of them, and it means that maybe a sample is an outlier for one of the PCs, but for the general computation this sample is fine.

A more conservative approach is to put a bigger distance for the cutoff, or a warning an an action cutoff.

In this plot we see the distance respect to the firts two PCs, and we see the same two outliers than in the previous post (four in the case of the Robust Covariance Matrix).

But if we consider the PC2 and PC3·, there are not outliers, and all the samples are bellow the cutoff:

This are sample very with certain physical properties, which make them sensitive to be outliers in the first PC, but not in the rest of the PC score maps.

If somebody wants the file to follow the tutorial, just let me know by mail and I will send it to you. The script is on my Github page

The chemometric package is the R companion to the book "Introduction to Multivariate Statistical Analysis in Chemometrics" written by K. Varmuza and P. Filzmoser (2009)