31 mar 2019

Combinando SNV y Segunda Derivada

The video shows the differences between a raw spectra of lactose (where we can se high differences due to the particle size). After applying SNV those differences almost disappear.
 
If we combine SNV and Second Derivative, we increase the resolution but the effect of the particle size is also take it out due to the SNV part, but we see an improvement in the resolution.
 
In case we only apply the Second Derivative we increase the resolution and keep the particle size effect.
 
All these combinations help to find the best option for a quantitative or qualitative model.

29 mar 2019

DC2 maximun distance Win ISI algorithm (using R)

DC2 maximum distance algorithm is one of the Identification methods where the mean and standard deviation spectra of the training set spectra are calculated.

The unknown spectra is centered with the mean training spectra and after are divided by the standard deviation spectra, so we get a spectra matrix of the distances.
We can see some high distances for the samples with high difference respect to the mean (samples 1,2 and 3)

After we only have to calculate the maximum value (distance) of this matrix for every row (spectrum)

We can determine a cutoff of 3 by default, but it can be different if needed

27 mar 2019

Tidy Tuesday screencast: analyzing pet names in Seattle

It is always great to see David @drob  how well he works with R in their #tidytuesdays videos
 

25 mar 2019

Reconstruction and RMS

Still working trying to get a protocol with R in a Notebook to detect adulteration or bad manufactured batches of a mixture.

It is important in the reconstruction the selection of the number of principal components. We get two matrices: T and P to reconstruct all the samples in the training set, so if we subtract from the real spectrum the reconstruction we get the residual spectrum.

These residual spectra may have information so we need to continue adding Principal Component terms until no information seems to be on them.
With new spectra batches we can project them on the PC space using the P matrix and get also their reconstructed spectra, and their residual spectra hoping to find patterns in the residual spectra which justify if they are bad batches.

This is the case of some of this batches shown in red over the blue residuals from the training data:

One way to measure the noise and to decide if the samples in red are bad batches respect the training samples is the statistic RMS. I overplot the RMS in blue for the training samples and in red for the test (in theory bad samples). The plot show that some of the test samples have higher RMS values than the training set.
A cutoff value can be fit in order to determine this in routine.



21 mar 2019

Overploting residual spectra of Training and Test sets (Good Product)

After we have develop a Prediction Model with a certain number of Principal Components, there is always a residual matrix spectra with the noise not explained by the Model. Of course we can add or reduce the number of PCs, but we can overfit or underfit the model increasing the noise in the model or leaving interested variance in the Residual Matrix.
 
This residual matrix is normally called "E".
 
Is interesting to look to this matrix, but specially for detection of adulterants, mistakes in the proportions of a mixture or any other difference between the validation samples (in this case in theory bad samples) and the training matrix residuals.
 
In this case I overplot both for a model with 5 PCs (in red the validation samples residual spectra and in blue the training residual spectra).


We can see interesting patterns that we must study with more detail to answer some questions, about if the model is underfitted, if we see patterns enough to determine if the validation samples have adulterations or changes in the concentrations of the mixture ingredients and so on, or if there are for some reasons in the model samples that should have been considered as outliers and be taken out of the model.


20 mar 2019

Projecting bad batches over training PC space

Dear readers, along this night this blog will reach the 300.000 visits and I am happy about that. So thanks to all of you for visiting this blog.
 
Along the last posts I am writing about the idea to get a set of samples from several batches of a product which is a mixture of other products in a certain percentage. Of course the idea is to get an homogeneous product with the correct proportions of every product which takes part of the mixture.
 
Anyway there is variability in the ingredients of the mixture itself (different batches, seasons, origins, handling,..), and there are also uncertainty in the measuring of the quantities. It can be much worse if by mistake an ingredient is not added to the mixture or is confused by other.
 
So, to get a set with all the variability that can be allowed is important to determine if a product is correctly mixed or manufacturer.
 
In this plot I see a variability which I considered correct in a "Principal Component Space"
Over this PC Space we project other batches and we check if the projections falls into the limits set during the development of the PC Model. Of course it can appear new variability that we have to add to the model in a future update.
 
But to check it the model performs fine we have to test it with bad building batches, and this is the case in the next plot where we can see clear batches that are out of the limits (specially samples 1,2 and 3) with much more water than the samples in the training model.
 
We have to see the other samples much more in detail and to detect if the are wrong and the reason why.
So coming post about this matter soon.

12 mar 2019

Over-plotting validation and training data in the Mahalanobis ellipses

One of the great things of R is that we can get the code of the different functions (in this case the function "drawMahal" from the package "Chemometrics" ) and adapt this code to our necessities.
 
I wanted to over-plot the training set scores for the first and second principal components with the scores of the validation set, which are redundant samples taken apart in a selection process with the function "puchwain" from the package "prospectr", but I get problems with the scale due to the way "drawMahal" fix the X and Y limits. But editing the function we can create a personalize function for our case and to compare the redundant samples in red with the training samples in black.
 
Now the next is to over-plot the test samples (in theory bad samples) in another color in a coming post.