4 nov. 2013

Step-up & Stepwise Regressions with R

We know that a spectrum has a serial of variables ("x1","x2",....."xm", one at each wavelength), which are the photometric responses for a certain sample. Once we have a certain number of samples "n", we want to develop ‘a model to predict a constituent "y".
Practicing with a table from the book "Introduction to Multivariate Statistical Analysis in Chemometrics" we can set the basis to understand the concepts of "Step-up" and "Stepwise" Regression.
Let´s see the RSQ of each wavelength vs. the "y" variable:
(cor(x1,y1))^2    0.4768583
(cor(x2,y1))^2    0.4112143
(cor(x3,y1))^2    0.02528684
We can see that  variables "x1" and "x2" have some correlation, but the variable "x3" has a very poor correlation with "y1".
So we start the model with just one term (the x1), and we will add a new term, which will improve the model with a high F test value, we can try with "x2" and "x3", and we clearly see that the model will improve adding "x2" (better XY plot for x1_2 than for x1_3).

res1_2<-lm(y1~x1+x2)
#Intercept= 1.3528 x1= 4.4328 x2= 4.1164
Improving the RSQ to:
(cor(x1_2,y))^2    0.8880725

Adding "x3" to the model as a third term does not improve the RSQ, because it has a very low coefficient (practically cero).
x1_2_3<-1.41833 + 4.42317*x1 + 4.10108*x2 -0.03574*x3
(cor(x1_2_3,y1))^2   0.8883945

We can use the function step to find the better combination of variables to develop a model. In the case of just 3 , we can proceed as this:
lm_all<-lm(y~x1+x2+x3)
lm_step<-step(lm_all,direction="both")
summary(lm_step)

The summary give us the best variables to use and the regression coeficients: b0, b1 and b2 ( as we can see, with the same results than for "res1_2")
#(Intercept):1.3528 x1= 4.4328 x2= 4.1164

--------------------------------------------------------------------------------------------------------------------------
Let´s use now the package "Chemometrics", with NIR data:
The spectra are:

We use the stepwise function from the chemometric package to get values as:
\$usedtime
\$models
\$varnames
RSS (Residual Sum of Squares) is an statistic wich decrease as far as we add a new term, so the BIC formula has a penalty every time we add a new term.
BIC is an statistic which is decreasing as far as we improve the model, variables are added or dropped until the BIC can not be reduced more.

The wavelengths (\$varnames) selected are:
"X1115.0" "X1185.0" "X1215.0" "X1385.0" "X1420.0" "X1500.0" "X1565.0" "X1585.0" "X1690.0" "X1715.0" "X1720.0" "X1815.0" "X1995.0" "X2070.0" "X2100.0" "X2195.0"
So we can go ahead with the regression:
y ~ X1115.0 + X1185.0 + X1215.0 + X1385.0 + X1420.0 + X1500.0 +
X1565.0 + X1585.0 + X1690.0 + X1715.0 + X1720.0 + X1815.0 +
X1995.0 + X2070.0 + X2100.0 + X2195.

We have found the 15 wavelengths for the model, but now we want to check how it performs, so we need some data for validation in order to calculate the SEP (standard error of prediction) that we expect for future samples predicted with this model.
We are going to do it using the Cross Validation (CV). As you know, with this type of validation, the total sample set is divided in groups, for example four, one group is used for validation and with the other three a MLR regression is made with the 15 wavelengths (X matrix) and the parameter Glucose (Y matrix). The samples which belong to each group are selected randomly. This process will be repeated for a number of times (in this case 100 times), so it seems to be a well mode to validate our model. This type of Cross Validation is called "k-fold cross validation".
This time I prepare a video, to explain the validation regression procedure.

Recomended book in NIR-Quimiometría to follow tutorials: