7 oct 2021

Modelling complex spectral data (soil) with the resemble package (I)

It is time for a new tutorial working with R, and with one of their packages: "Resemble". We will use soil NIR spectra and we will follow the explanations given by the authors in this vignette:

Modelling complex spectral data with the resemble package (Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux)

From certain time I am interested in the use of NIR spectra to develop models, and to follow this tutorial can help me to understand better how to apply several type of regressions and to see the performance for this complex matrix like soil.


Let´s create a new R Markdown file (.Rmd), and load the three libraries that we will use:

library(tidyverse)
library(resemble)
library(prospectr)
library(magrittr)


Let´s load the data "NIRsoil", and we can see that its class is "dataframe". This data frame combines the spectral matrix (NIRsoil$spec) which contains the predictor variables for every spectrum, with the responses variables (Nt, Ciso and CEC) and another variable which specify if the spectra is used in the training set or the test set. You can read at the vignette details from the authors about what each of these response variable represents and their units.

data("NIRsoil")

Run some code to have details about the number of samples available, the range for each response variable, how many "NA" you have, the wavelength range of the NIR spectrophotometer used,...., and other info that you consider useful before to go into other steps. For example to see the distribution of the Nitrogen:

NIRsoil$Nt %>%
summary()
NIRsoil%>%
ggplot(aes(Nt)) +
geom_histogram()


Min.   1st Qu.  Median   Mean  3rd Qu.   Max.    NA's 
0.200   1.100   1.300   1.766   2.000   8.800     180 


Do the same for the other constituents, changing the "Nt" response variable for the other variables "Ciso" and "CEC".

It is important to check how the response variables correlate between them:

response<- NIRsoil[ , 1:3] %>%
drop_na()
corrplot(cor(response), method = "number")



As we can see there is a high correlation between the parameters.
We can check it in a XY plot for thr "Ciso" and "Nt":

response %>%
ggplot(aes(x = Ciso , y = Nt)) +
geom_point()


This can be a good starting point and we will continue on a new coming post.

No hay comentarios:

Publicar un comentario