30 oct 2018

Confusion Matrix with Caret


This is a useful tool in R in order to evaluate a predictive model for classification. We know the expected value and the predicted on and from that we can get the Confusion Matrix and the useful statistics based by formulas from that matrix.
I reproduce here the code from the post: "How To Estimate Model Accuracy in R Using The Caret Package"  from the blog "Machine Learning Mastery":

# load the libraries
library(caret)
library(klaR)
# load the iris dataset
data(iris)
# define an 80%/20% train/test split of the dataset
split=0.80
trainIndex <- createDataPartition(iris$Species,

                                  p=split,
                                  list=FALSE)
data_train <- iris[ trainIndex,]
data_test <- iris[-trainIndex,]
# train a naive bayes model
model <- NaiveBayes(Species~., data=data_train)
# make predictions
x_test <- data_test[,1:4]
y_test <- data_test[,5]
predictions <- predict(model, x_test)
# summarize results
confusionMatrix(predictions$class, y_test)


Try to understand the results, some samples are well classified and others not. So we must try to find the model where we have the better statistics for the classification. This is a simple example, but why not to try this machine learning algorithms to spectra for classification and use the confusion matrix to get the best model.
The statistics we get running the last line of code are:
> confusionMatrix(predictions$class, y_test)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0          9         1
  virginica       0          1         9

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.7793, 0.9918)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 8.747e-12       
                                          
                  Kappa : 0.9             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9000           0.9000
Specificity                 1.0000            0.9500           0.9500
Pos Pred Value              1.0000            0.9000           0.9000
Neg Pred Value              1.0000            0.9500           0.9500
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3000           0.3000
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9250           0.9250

An easy example to understand the confusion matrix can be with this code:
library(caret)
expected <- factor(c(1, 1, 0, 1, 0, 0, 1, 0, 0, 0))
predicted <- factor(c(1, 0, 0, 1, 0, 0, 1, 1, 1, 0))
results <- confusionMatrix(data=predicted, reference=expected)
print(results)

Where you get:
> print(results)
Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 4 1
         1 2 3
                                          
               Accuracy : 0.7             
                 95% CI : (0.3475, 0.9333)
    No Information Rate : 0.6             
    P-Value [Acc > NIR] : 0.3823          
                                          
                  Kappa : 0.4             
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.6667          
            Specificity : 0.7500          
         Pos Pred Value : 0.8000          
         Neg Pred Value : 0.6000          
             Prevalence : 0.6000          
         Detection Rate : 0.4000          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.7083   

From the Caret Documentation which are the formulas for these statistics:

1 comentario: