14 oct. 2018

CARET:Splitting Based on the Predictors

I´m practicing with CARET, and the best way is to follow the tutorials in the webpage. This time is the way how we can split the data with Caret:

4.2:Splitting Based on the Predictors

Read and try to understand the concept.
I try to write the code of the plot for the plot  of the figure, and finally more or less I do it:

testing <- scale(BostonHousing[, c("age", "nox")])
set.seed(5)
## A random sample of 5 data points
startSet <- sample(1:dim(testing)[1], 5)
samplePool <- testing[-startSet,]
start <- testing[startSet,]
newSamp <- maxDissim(start, samplePool, n = 20)
newSamp<-samplePool[newSamp,]
rownames(newSamp)<-c(1:20)
plot(samplePool[,1],samplePool[,2],pch=20,

     col="grey",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="age",ylab="nox")
par(new=TRUE)
plot(start[,1],start[,2],pch="S",

     col="red",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="",ylab="",cex=1.3, font=2)
par(new=TRUE)
plot(newSamp[,1],newSamp[,2],col="blue",xlim=c(-2.500,1.400),
     ylim=c(-1.600,2.900),xlab="",ylab="")
text(newSamp[,1],newSamp[,2],

     labels=rownames(newSamp),cex=1.3,     
     font=2)

The samples chosen are different because of the random order. See how the distribution of the chosen samples cover the structure of the data.

10 oct. 2018

PCA with Caret

In this plot we teste different types of Principal Components Analysis with different packages. This time I use Caret.

I use the same Tecator Meat data which comes with the package. Spectra is treated with MSC  (Multiple Scatter Correction) and I represent the plane of the scores with the two terms chosen by the PCA processing:

absorp_pca<-preProcess(absorpTrainMSC,
                       method = c("center", "scale","pca"),
                       thresh = 0.95)
PC_scores_train<-predict.preProcess(absorp_pca,absorpTrainMSC)
plot(PC_scores_train[,1],PC_scores_train[,2],col="blue",
     xlim=c(-15,11),ylim = c(-20,11),
     xlab = "PC1",ylab = "PC2")
PC_scores_test<-predict.preProcess(absorp_pca,absorpTestMSC)
par(new=TRUE)
plot(PC_scores_test[,1],PC_scores_test[,2],col="red",
     xlim=c(-15,11),ylim = c(-20,11),
     xlab = "",ylab="")


Now we get the plot of the scores for the training set in blue and for the test set in blue.




9 oct. 2018

Playing with CARET and TECATOR MEAT DATA

###### LOADING TECATOR DATA ###################################
library(caret)
library(pls)
data(tecator)
#' Loading the tecator data we load two matrices:
    #' The spectra matrix "absorp" (raw spectra)
    #' We want to create another matrix with MSC math treatment
    absorpMSC<-msc(absorp)
    #' The constituents matrix "endpoints (Moisture, Fat & Protein)
set.seed(930)
#We can add names to the columns with the wavelengths values.
wavelengths<-as.matrix(seq(850,1048,by=2))
colnames(absorp)<-wavelengths
colnames(endpoints)<- c("Moisture","Fat","Protein")
#' We will model the protein content data and create a data partition
#' leaving 3/4 for the training set and 1/ for the validation set.
#' With the createDataPartition we generate a selection of sample positions
#' in a ramdon order to take after this samples out from the absorp and
#' endpoint matrices. 
###### SPLITTING THE DATA #####################################
trainMeats <- createDataPartition(endpoints[,3], p = 3/4)
#'Now we select the correspondant training and validation matrices
#'with the raw and MSC treated spectra
absorpTrain  <- absorp[trainMeats[[1]], ]
absorpTrainMSC<-as.matrix(absorpMSC[trainMeats[[1]], ])
absorpTest   <- absorp[-trainMeats[[1]], ]
absorpTestMSC   <- as.matrix(absorpMSC[-trainMeats[[1]], ])
#########  RAW SCAN SPECTRA  ##################################################
matplot(wavelengths,t(absorpTrain),type="l",
        xlab="wavelengths",ylab="Absorbance",col="blue")
par(new=TRUE)
matplot(wavelengths,t(absorpTest),type="l",
        xlab="",ylab="",col="green")
#########  MSC SCAN SPECTRA  ##################################################
matplot(wavelengths,t(absorpTrainMSC),type="l",xlab="wavelengths",
        ylab="transmitance",ylim =c(min(absorpTrainMSC)-0.1,
                                    max(absorpTrainMSC)+0.1),
        col="blue")
par(new=TRUE)
matplot(wavelengths,t(absorpTestMSC),type="l",xlab="wavelengths",
        ylab="transmitance",ylim =c(min(absorpTrainMSC)-0.1,
                                    max(absorpTrainMSC)+0.1),
                                    col="green")
#'and from the endpoint matrix for every constituent
moistureTrain <- endpoints[trainMeats[[1]], 1]
fatTrain <- endpoints[trainMeats[[1]], 2]
proteinTrain <- endpoints[trainMeats[[1]], 3]
# The rest of the samples go to the Validation Set
moistureTest <- endpoints[-trainMeats[[1]],1]
fatTest <- endpoints[-trainMeats[[1]], 2]
proteinTest  <- endpoints[-trainMeats[[1]], 3]
#We can combine these two matrices:
  # For Protein
trainDataProt<-cbind(proteinTrain,absorpTrain)         #Protein Raw Training
testDataProt<-cbind(proteinTest,absorpTest)            #Protein Raw Test
trainDataProtMSC<-cbind(proteinTrain,absorpTrainMSC)   #Protein Raw Training
testDataProtMSC<-cbind(proteinTest,absorpTestMSC)      #Protein Raw Test  
  #For Fat
trainDataFat<-cbind(fatTrain,absorpTrain)               #Fat Raw Training
testDataFat<-cbind(fatTest,absorpTest)                  #Fat Raw Test
trainDataFatMSC<-cbind(fatTrain,absorpTrainMSC)         #Fat MSC Training
testDataFatMSC<-cbind(fatTest,absorpTestMSC)            #Fat MSC Test
  #For Moisture
trainDataMoi<-cbind(moistureTrain,absorpTrain)          #Moisture Raw Training
testDataMoi<-cbind(moistureTest,absorpTest)             #Moisture Raw Test
trainDataMoiMSC<-cbind(moistureTrain,absorpTrainMSC)    #Moisture MSC Training
testDataMoiMSC<-cbind(moistureTest,absorpTestMSC)       #Moisture MSC Test
#####  BUILDING THE MODELS ####################################
#####  MODELS FOR MOISTURE
model_moi_raw <- train(moistureTrain~.,data=trainDataMoi, method = "pls",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 20)
model_moi_msc <- train(moistureTrain~.,data=trainDataMoiMSC, method = "pls",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 20)
#####  MODELS FOR FAT
model_fat_raw <- train(fatTrain~.,data=trainDataFat, method = "pls",
                       scale = TRUE,
                       trControl = trainControl("cv", number = 10),
                       tuneLength = 20)
model_fat_msc <- train(fatTrain~.,data=trainDataFatMSC, method = "pls",
                       scale = TRUE,
                       trControl = trainControl("cv", number = 10),
                       tuneLength = 20)
#####  MODELS FOR PROTEIN
model_prot_raw <- train(proteinTrain~.,data=trainDataProt, method = "pls",
                   scale = TRUE,
                   trControl = trainControl("cv", number = 10),
                   tuneLength = 20)
model_prot_msc <- train(proteinTrain~.,data=trainDataProtMSC, method = "pls",
                   scale = TRUE,
                   trControl = trainControl("cv", number = 10),
                   tuneLength = 20)
######  PREDICTIONS ########################################
## PROTEIN PREDICTIONS
pred_prot_test_raw <- predict(model_prot_raw,testDataProt)
pred_prot_test_msc <- predict(model_prot_msc,testDataProtMSC)
## FAT PREDICTIONS
pred_fat_test_raw <- predict(model_fat_raw,testDataFat)
pred_fat_test_msc <- predict(model_fat_msc,testDataFatMSC)
## MOISTURE PREDICTIONS
pred_moi_test_raw <- predict(model_moi_raw,testDataMoi)
pred_moi_test_msc <- predict(model_moi_msc,testDataMoiMSC)
## PREPARING DATA FOR MONITOR FUNCTION
compare<-cbind(moistureTest,pred_moi_test_raw,pred_moi_test_msc,
               fatTest,pred_fat_test_raw,pred_fat_test_msc,
               moistureTest,pred_moi_test_raw,pred_moi_test_msc)
ID<-seq(1,52,by=1)
compare<-cbind(ID,compare)
#### MONITORING AND STATISTICS  #########################
monitor10c26_003(compare[,c(1,2,3)])
monitor10c26_003(compare[,c(1,2,4)])
monitor10c26_003(compare[,c(1,5,6)])
monitor10c26_003(compare[,c(1,5,7)])
monitor10c26_003(compare[,c(1,8,9)])
monitor10c26_003(compare[,c(1,9,10)])
#' For Moisture and Fat there is an improvement using the model
#' with MSC math treatment,
#' For Protein the result are almost the same, but with the
#' raw spectra the is a certain slope and intercept problem,
#' and if corrected there is an improvement in the statistics.


3 oct. 2018

PC Regressions with CARET

I did started to use Caret, and I will continue using it, so I have to try a lot of things in R to become familiar with it.
 
In Caret the are a data set (data=tecator) from a Tecator instrument for meat analysis, working in transmitance and in the range from 850 to 1050 nm with a total of 100 data points.
 
The parameters are Moisture, Fat and Protein. You can play around with this data to become familiar with Caret, so I try to create a quick regression with PCR.
 
Caret let us prepare the Training and Testing Data in a random order and to train the model with several kinds of cross validations. So I wrote some code apart from the help I found in the available Caret Documentation.

data(tecator)
set.seed(930)
colnames(absorp) <- paste("x", 1:ncol(absorp))
## We will model the protein content data
trainMeats <- createDataPartition(endpoints[,3], p = 3/4)
absorpTrain  <- absorp[trainMeats[[1]], ]
proteinTrain <- endpoints[trainMeats[[1]], 3]
absorpTest   <- absorp[-trainMeats[[1]], ]
proteinTest  <- endpoints[-trainMeats[[1]], 3]
trainData<-cbind(proteinTrain,absorpTrain)
testData<-cbind(proteinTest,absorpTest)

model <- train(proteinTrain~.,data=trainData, method = "pcr",
               scale = TRUE,
               trControl = trainControl("cv", number = 10),
               tuneLength = 10)

names(model)
model$bestTune
summary(model$finalModel)
predictions <- predict(model,testData)
plot(predictions,proteinTest)


With this code we get plots and statistics. This is my first step into Caret, where I would like to go really deep into. So I hope to write more posts about this subjet.