12 mar. 2012

NIT: Fatty acids study in R - Part 006

In one of the columns, for constituent C16_0, one sample (57) has a value of “zero” (we could see this in the histogram).The reason for that is that the laboratory did not supply this value. The PLS regression will consider the lab value as cero, so we will get a plot like this:

I observed also that the sample 219 has a high residual for the regressions of all the constituents, so I decided to remove these two samples from the sample set in order to continue, and to develop the models.
I am starting with R, so I will appreciate if you add comments in order to do this task in a simpler way.
I create two sample sets, in order to remove these two samples (219 and 57):

> fattyac1<-fattyac_msc[1:56,]
> fattyac2<-fattyac_msc[58:218,]
and I combined this three sets again:
> fattyac_msc1<-rbind(fattyac1,fattyac2)
Well, I can develop my regression now:

Now we have to take the decision of how many terms to choose. Let´s see the validation plot with 7 and 12 components (terms).


It is clear that the decision to choose one model or the other will have a great influence in the predictions. We need a validation set to make a better decision. But I think that it will work better with 12 terms.
It will be important, if possible to find samples with C16:0 values bellow 18 to add to our database in order to develop a better model.
Another decision could be to keep out this extreme sample until we find more, but we can decide to leave it, in order to extrapolate better in this zone.
It is important not to have unique samples in the model. In this case we have one. We have to consider this.

If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

No hay comentarios:

Publicar un comentario