28 sept 2012

How to apply discriminant analysis in ISI Scan 4.xx




This video is part of a presentation I´m prepairing about the use of discriminant analysis in Win ISI 4, and their aplication in ISI Scan 4.

22 sept 2012

PLS2 with "R"

I´ve been working these days with PLS2 calibrations with a chemometric software called “Unscrambler” with a data set called “jam”. I said “can I develop PLS2 models with R?”.I look in the book “Introduction to Multivariate Statistical Analysis in Chemometrics”, and I got the response “Yes, we can”.
I have other posts for PLS regressions, but it is PLS1, where we have an X matrix (spectra) and we make a regression for one constituent of the Y matrix at a time. What about to make the regression for all the constituents at the same time using the whole Y matrix?. That is the purpose of PLS2.
PLS2 is recommended when there is a high correlation between the constituents.
library(chemometrics)
data(cereal)
This data is part of a set used by Varmuza et al. 2008, for other papers.
You can get a description for this data in the R help page:
Description
For 15 cereals an X and Y data set, measured on the same objects, is available. The X data are 145 infrared spectra, and the Y data are 6 chemical/technical properties (Heating value, C, H, N, Starch, Ash). Also the scaled Y data are included (mean 0, variance 1 for each column). The cereals come from 5 groups B=Barley, M=Maize, R=Rye, T=Triticale, W=Wheat.

Once loaded, take a look to the data
dim(cereal$X)
dim(cereal$Ysc)
We can have a look to the spectra, (it is already treated with SG and first derivative).
wavelengths<-seq(1126,2278,by=8)
matplot(wavelengths,t(cereal$X),lty=1,xlab="wavelengths(nm)",ylab="log(1/R)")
Now let´s run PLS2, using “mvsr”, with “LOO” (leave one out) cross validation.
cerpls2<-mvr(Ysc~X,data=cereal,method="simpls",validation="LOO")
We can see a summary of the results:
summary(cerpls2)
Now we have to take an important decision, “How many terms to choose?”.
Plots can help us with it:
plot(RMSEP(cerpls2), legendpos = "topright")

We have to select an average, and looking to the plots we can say that 7 is fine, anyway for "starch" less terms would be fine, but for the rest 6 or 7 is the correct number.

14 sept 2012

BLOSSOMS - The Quadratic Equation- It's Hip to Be Squared



This is another nice lecture from MIT professor Gilbert Strang. It is good to see this videos from time to time to understand what is behind all the math treatments in chemometrics. Polynomials are used for some math treatments as Savitzky-Golay, where the solution of the polynomial (2º degree, 3º degree,...) is used as the absorbance modified value for the middle point in a moving window of a certain odd segment.

Unscrambler (Jam Exercise) - 004

In the posts:
I ´ve been practicing Unscramber with some of the Demo files (Jam), used in the book “Multivariate Data Analysis - in practice” and following the tutorials.
I continue in this post with an important part: Compare the models in order to be sure which one is better, PCR or PLS1, to predict the Y parameter “preference”. For this is clear that we have to look to the residual variance left by the models, taking into account of course the number of terms, over-fitting,…
If we have a look to the plot for the Y residual variance for the PCR, we see an increase in the residual variance for the first PC. That is not good….but think about it.
The PCA does not take into account the Y matrix, so the first PC can be related to some important X structure which cannot be related to the Y parameter. Once extracted, the second PC correlates better with the Y matrix,but still not as good as the first PLS1 term . So this type of plots helps us to understand what is happening.
 Let´s see now the PLS1 residual variance plot for Y, we have a much better prediction with the first term, because the Y matrix was a part of the calculation process in the PLS1.
We have to decide for the model, the best number of terms, and software’s as Unscrambler can decide by you the best option, but you can change the number up or down. You have the control, but we have to check more plots and statistics, before to decide the best option.

6 sept 2012

Unscrambler (Jam Exercise) - 003

In the Jam exercise we have 3 groups of variables:
Preference: 114 representative consumers tasted the 12 jam samples and gave their scores in a scale from 1 to 9.The data on this variable is the mean value for each sample. This is the profiling of jam quality.
Sensory: Trained sensory taste panelist judged the 12 jam samples giving their scores for 12 variables.
Instrumental: It is the measure of 6 chemical and instrumental variables. This is the cheapest method.
We have develop in the post “Unscrambler (Jam Exercise) - 001“ a PCR using Sensory as the “X” matrix and  Preference as the “Y” (constituent matrix).
We have develop in the post “Unscrambler (Jam Exercise) - 002“ a PLS1 using Sensory as the “X” matrix and  Preference as the “Y” (constituent matrix).
Other alternative could be to use Instrumental as “X” and “Preference” a “Y”.
Now we are going to develop a PLS2 regression using “Instrumental” as the “X” matrix and “Sensory” as the “Y” matrix.
PLS2 allow several variables in the “Y” matrix at the same time.
Which of the variables from Y (expensive sensory method) can be determined by X (cheapest instrumental method)?.
When developing the PLS2 regression we obtain this overview plot:

We see in the upper left plot how the first term PC1 explains most of the variability due to harvest time.
Lower left plot give us the explained variance for every Y parameter. We don´t want to use too many to avoid overfitting, so if we look to this plot carefully:
We see how we explain mainly (in PC2), Sweetness,Redness,Colour and Thickness.
The reason for this is seen in the loading plot.
 We should add more variables in Y from other instruments or chemical analysis, in order to see if we can explain some others X variables.
 

4 sept 2012

Unscrambler (Jam Exercise) - 002

PCR is a MLR regression, but indeed using X matrix, we use the T (scores) matrix. We know that the X explained variance is normally quite high for this first PC and decrease with the others, but in the PCR there is no guarantee that the explained variance for the Y follow that order and in the case of the post “Unscrambler (Jam Exercise) – 001 is just 1% of explained variance for the first PC, 57% for the second and 34% for the third. This does not happen for the PLS.
Let´s develop a PLS1 regression for the same X (sensory) and Y (preference) than in “Unscrambler (Jam Exercise) – 001.
We see how the first PLS term explain 91% of the variance, 3% the second and 2% the third.
The first term is very influence by parameters as Thickness which is inverse correlated with others which are preferenced by the consumers (Redness, Colour, Sweetness and Juiciness) . Other parameters as Chewiness, Bitterness, Rasp smell and flavor, do not have influence in the preferences of the consumers.
PLS terms 1 and 2, model quite well the groups for harvesting time H1, H2 and H3, which explain the important parameters for the customers. So,...., 2 terms seem enough for the model.
See in this link details from CAMO about this Jam data set: http://www.camo.com/products/unscrambler/trial.swf
 


3 sept 2012

NIR Guidelines to use NIR in the pharmaceutical Industry

You can download a draft published by the European Medicines Agency about the Guideline on the "Use of NIRS by the pharmaceutical Industry".

Unscrambler (Jam Exercise) - 001

I will write during the next days some posts about a famous exercise of Unscrambler describe in the book "Multivariate Data Analysis - in practice", in order to help myself improving my knowledge about this software.
This exercise has raspberry samples from 4 different locations and harvested at 3 different times.
The names of the files are C”a”H”b” where C is the indication for the location and "a" has a value of 1 for location 1, 2 for location 2, 3 for location 3, and 4 for location 4.
H is the indication for Harvest time and "b" has a value of 1 for the early harvest, 2 for the middle harvest and 3 for the late harvest.
When developing a PCR (X variables= a serial of sensory parameters, Y =average value of the preference of 114 consumers for each sample), the scores and loadings are calculated as PCA.
We visualize a group for samples harvested early (H1), clearly in the plot PC1 vs PC2.
We see the variance explained by the taste variations along PC1 (48%), 28% along PC2 and 21% along PC3.
“Y” variable is not well represented by PC1 (only 1%), but the variance explained for “Y” in PC2 is 57% and 34% for PC3.
We see how sweetness has a small loading in PC1 vs PC2 (consider as not important), but  it becomes an important variable along PC3.
We can see correlations between the “X” variables:
Which variable/s, is/are inverse correlated with “thickness”? Redness and color are inverse correlated, which is a characteristic of maturity (late harvest).
Which samples are more thickness (harvested early, middle or late), why? It is clear that sample harvested early, and the samples harvested late are less value for this parameter.
We can get a lot of conclusions from these plots if we study them carefully, as which samples and from which places are preferred by the consumers. We see how samples from places 1 and 3 harvested late are preferred by their red intensity color.
See in this link details from CAMO about this Jam data set: http://www.camo.com/products/unscrambler/trial.swf