25 oct 2018

Building Predictive Models in R Using the caret Package

I recommend the reading and practice of the paper :

Building Predictive Models in R Using the caret Package

you can follow the Tutorial with the Mutagen Data in R is a good practice.
The code is in the paper, but in some cases we have to work with R to do certain steps like the code in red.

library(caret)
set.seed(1)
in.Train<-createDataPartition(mutagen,p=3/4,list=FALSE)
trainDescr<-descr[in.Train,]              #used for model training
testDescr<-descr[-in.Train,]               #used to evaluate model performance
trainClass<-mutagen[in.Train]           #used for model training
testClass<-mutagen[-in.Train]           #used to evaluate model performance
prop.table(table(mutagen))                #distribution mutagen all
prop.table(table(trainClass))             #distibution of the training set
#There were three zero{variance predictors in the training data.
sum(apply(trainDescr, 2, var) == 0)     # 3
variance<-apply(trainDescr, 2, var)
zv<-variance==0
which(zv, arr.ind = TRUE, useNames = TRUE)
#T.F..Br. G.F..Br.    I.097
#155      708         1539
trainDescr<-trainDescr[,-c(155,708,1539 )]  #zero variance descriptors removed
testDescr<-testDescr[,-c(155,708,1539 )]    #zero variance descriptors removed

#We also remove predictors to make sure that there are no
#between-predictor (absolute) correlations greater than 90%:
ncol(trainDescr)                        #1576
descrCorr<-cor(trainDescr)              #Correlation Matrix   1579.1579
highCorr<-findCorrelation(descrCorr,0.90)
#Remove the high correlated descriptors from the Training and Test sets
trainDescr<-trainDescr[,-highCorr]
testDescr<-testDescr[,-highCorr]
ncol(trainDescr)                        #650

1 comentario: