I recommend the reading and practice of the paper :
Building Predictive Models in R Using the caret Package
you can follow the Tutorial with the Mutagen Data in R is a good practice.
The code is in the paper, but in some cases we have to work with R to do certain steps like the code in red.
library(caret)
set.seed(1)
in.Train<-createDataPartition(mutagen,p=3/4,list=FALSE)
trainDescr<-descr[in.Train,] #used for model training
testDescr<-descr[-in.Train,] #used to evaluate model performance
trainClass<-mutagen[in.Train] #used for model training
testClass<-mutagen[-in.Train] #used to evaluate model performance
prop.table(table(mutagen)) #distribution mutagen all
prop.table(table(trainClass)) #distibution of the training set
#There were three zero{variance predictors in the training data.
sum(apply(trainDescr, 2, var) == 0) # 3
variance<-apply(trainDescr, 2, var)
zv<-variance==0
which(zv, arr.ind = TRUE, useNames = TRUE)
#T.F..Br. G.F..Br. I.097
#155 708 1539
trainDescr<-trainDescr[,-c(155,708,1539 )] #zero variance descriptors removed
testDescr<-testDescr[,-c(155,708,1539 )] #zero variance descriptors removed
#We also remove predictors to make sure that there are no
#between-predictor (absolute) correlations greater than 90%:
ncol(trainDescr) #1576
descrCorr<-cor(trainDescr) #Correlation Matrix 1579.1579
highCorr<-findCorrelation(descrCorr,0.90)
#Remove the high correlated descriptors from the Training and Test sets
trainDescr<-trainDescr[,-highCorr]
testDescr<-testDescr[,-highCorr]
ncol(trainDescr) #650
Thank you ever so for you article. Really Cool.
ResponderEliminarData Science Online Course
Data Science Training