3 ene 2021

R exercises 3.1

The idea of this blog for this year 2021, is to write posts about machine learning techniques for all the kind of data sets available in the R packages or  from other sources different to NIR or spectroscopy using different chemometric packages. Is important to learn all the basics and techniques to apply them later to spectroscopy data sets. Now for some time we will use the Caret package following the book "Applied Predictive Modelling" exercises. But, of course as soon as I can share more posts about NIR I will do.

One of the data frames available in R (in the mlbench package) is "Glass". From the package help we get the description:

Description
A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).
A data frame with 214 observations on 10 variables:
[,1] RI refractive index
[,2] Na Sodium
[,3] Mg Magnesium
[,4] Al Aluminum
[,5] Si Silicon
[,6] K Potassium
[,7] Ca Calcium
[,8] Ba Barium
[,9] Fe Iron
[,10] Type Type of glass (class attribute)

The first thing to do is to load the package and the data in our workspace and check their structure:

library(mlbench)
data("Glass")
head(Glass)
str(Glass)

Now we can place the predictor variables apart, and the "Type" variable alone:

Type<-Glass$Type
GlassData<-Glass[,-10]

The exercise of the book suggest: Using visualizations explore the predictors variables to understand their distributions as well as the relationships between predictors. So I check the distributions and the histograms if necessary:

library(e1071)
glassSkewValues<-apply(GlassData,2,skewness)
glassSkewValues

  RI    Na    Mg    Al    Si     K    Ca    Ba    Fe 
 1.60  0.45 -1.14  0.89 -0.72  6.46  2.02  3.37  1.73 

If the values are close to 0, it could mean that we have a normal distribution,  and if the number are positive or negative is a sign that the data is skewed to the right or to the left, so we can check the histograms of  "Na", "K" and "Mg".

par(mfrow = c(2,2))
hist(GlassData$Na, col= "blue")
hist(GlassData$Mg, col= "blue")
hist(GlassData$K, col= "blue")


Now let´s check the intercorrelation between the predictor variables:

library(corrplot)
correlations<-cor(GlassData)
corrplot(correlations, order = "hclust")

And the resulting plot give us a great view of the intercorrelations between them:

We will continue with this exercise in the next post.

No hay comentarios:

Publicar un comentario