20 dic 2022

PCA - NIT Tutorial (part 12)

Hemos visto los anómalos de distancia de mahalanobis marcados en los histogramas para los distintos parámetros. Veamos estos mismos anómalos marcados en los gráficos de inter-correlación entre parámetros y en los espectros con su correspondiente tratamiento matemático.

Todas estas observaciones previas al desarrollo de los modelos de calibración nos pueden ser muy útiles.


In this post we see the intercorrelation plots between parameters and the spectra math treated with the warning and action outliers marked for study. Click the image to access to the quarto post.




16 dic 2022

PCA - NIT Tutorial (part 10)

click the image to access to the Quarto blog post.


En este post trato el tema del cálculo de la distancia de Mahalanobis en el espacio de componentes principales y cómo podemos visualizarla gráficamente con elipses en sus planos o también con cubos, marcando los anómalos con distancias superiores al umbral que determinemos.

In this post I talk about the Mahalanobis distance calculation in the Principal Component Space and how we can look at graphically with ellipses in the PCA scores maps planes or also in 3D with cubes.



PCA - NIT tutorial (part 11)

 

Does exist a relationship between the Robust Mahalanobis distances and the concentration of the parameters?.  Click the image to access to the Quarto blog post which explain it.

3 nov 2022

PCA - NIT Tutorial (Part 9)

In this post I continue exploring the PCA looking to the scores maps and creating color palettes to see their relationship with the reference values.

Just click the image to go to the post:


28 oct 2022

R-Ladies STL: Introduction to Quarto with Isabella Velásquez


It was great to assist to the Meetup webinar with Isabella and R Ladies Saint Louis about Quarto. So the recording of this session and the github repositories are a good material to keep and study Quarto and practise Quarto.

11 oct 2022

SOIL SPECTROSCOPY GLOSOLAN WEBINAR

 

 

Interesting presentation from Leo-Ramirez-Lopez about the use of huge spectral libraries to select the proper samples to predict local samples using genetic algorithms that will be available on its R packages {rsamble} and {prospectr}., probably for the next year.

Github Tutorial

 


Managing - Part 2 (Github and RStudio) - RStudio

This is really a good tutorial about the use of RStudio and Github.

27 sept 2022

GLOSOLAN Soil Spectroscopy Webinar #11 | Soil sample preparation for MIR...

New Quarto blog for NIR and Chemometrics

 

I have created a new blog with Quarto, to work in R and R-Studio with spectroscopy data available for Near Infrared . By the moment, a serial of posts are coming about Near Infrared Transmitance with Meat data. It is a good ocasion to start learning from the beginning to work with chemometrics in R with this type of spectra


https://jrcuesta.quarto.pub/nir-chemometrics/

20 sept 2022

NIT Spectroscopic Tutorial with Caret (part 5)

Working these days with the tecator data set traying to find the better statistics as possible for the cross validation, and checking after with the test set. I use algorithms to remove the scatter but at the moment not derivatives to enhance the resolution. I show the statistics results for the CV and the validation results for the test set.

Multiple Scatter Correction has at the moment the better statistics for protein, but still some combinations of Math treatments could be tryed.


This is the XY plot for the test set with the pls model using the train set math treated with MSC.



16 sept 2022

NIT Spectroscopic Tutorial with Caret (part 4)

Looking to the spectra of the previous post was clear that we must apply a pre-treatment to remove the baselines shifts. One way to do it is to treat every spectrum individually, calculating a linear model (absorbance values vs. wavelengths) and getting the values of the slope at every wavelength, after these values are subtracted from the absorption values of the spectrum getting the baseline corrected spectrum. We repeat the process for every spectrum.

This function is available in the package pracma, and its name is detrend. In case we use it for the spectra pre-process in Caret with “center” and “scale”, we change the view from this:

 


To this:

 


This way we can see more clearly the variation in the spectra that we can try to interpret some way. Also, with this new pre-treatment we can check if the PCA change some way, and better to find outliers.

One of the options for interpretation are the correlation spectrum (see the correlation of every wavelength with the parameter of interest). In this case we can see how fat is inverse correlated with protein and moisture, and there are some positive correlations between protein and moisture.


We can see this if we run the correlation plot for the parameters;
correlation_par <- cor(endpoints)
library(corrplot)
corrplot(correlation_par, method = "ellipse")




See the code in basic R for the correlation spectra:

cor_moi <- cor(endpoints_train[ , 1], 
               train_scaled_fit)
cor_fat <- cor(endpoints_train[ , 2], 
               train_scaled_fit)
cor_prot <- cor(endpoints_train[ , 3], 
                train_scaled_fit)

matplot(seq(850, 1048, by = 2), t(cor_moi),
        xlab = "Wavelengths", ylab = "Absorbance", 
        main = "Correlation Spectra", type = "l", 
        ylim = c(-1, 1))

par(new = TRUE)

matplot(seq(850, 1048, by = 2), t(cor_fat),
        xlab = " ", ylab = " ", 
        main = " ", type = "l", col = "red", 
        ylim = c(-1, 1))

par(new = TRUE)

matplot(seq(850, 1048, by = 2), t(cor_prot),
        xlab = " ", ylab = " ", 
        main = " ", type = "l", col = "blue", 
        ylim = c(-1, 1))


legend("topright",                    
       legend = c("Moisture", "Fat", "Protein"),
       col = c("black", "red", "blue"),
       lty = 1)

12 sept 2022

NIT Spectroscopic Tutorial with Caret (part 3)

To follow the tutorial yo can see first:
We use two pretreatments in the previous posts to develop the Principal Components Analysis, but we can add another to remove the skewness of the predictor variables. We can see the skewness plotting a histogram of the absorptions at every wavelength, there will be 100 histograms to look, so the best way to check it can be a boxplot spectra (centred and scaled):

boxplot(train_scaled, main = "preProcess(Center + Scale)")

In the plot we can see that are skewness to the right, and some outliers.

Now we can apply the BoxCox algorithm together with Center and Scale and to look at the boxplot spectra:

train_scaled_2 <- preProcess(absorp_train,
                     method = c("BoxCox", "center", "scale"))

boxplot(train_scaled_2, 
        main = "preProcess(BoxCox + Center + Scale)")

The result shows how the skewness is removed. Of course, the PCA calculation will give different scores values, but still two terms will explain almost all the variance.

Looking to the spectra (treated either with “Center” and “Scale”, or “BoxCox”, “Center” and “Scale”) we see that there is a baseline shift that would be convenient that a pretreatment will remove in case we have to see clearly the variability of the fat, protein or moisture content.

6 sept 2022

NIT Spectroscopic Tutorial with Caret (part 2)

The first post of this tutorial brings a nice conversation on tweeter. The idea now, is to use Caret with other packages (necessary for some of the prepocessing of the data), but I will work in parallel at the same time with "tidymodels" packages (a good way to learn how to use them).

Thanks to Max Kuhn for send me a link to some interesting code from James Wade to use the Meat data in the tidyverse and tidymodels environment. By the way the great book "Applied PredictiveModelling" (Max Kuhn & Kjell Johnson) is very useful and this tutorial start as a possible solution for the exercise 6.1 in the book.

One of the problems to work with NIR or NIT spectra is the high collinearity of the predictors, this is due to NIR (Near Infrared Reflectance) and NIT (Near Infrared Transmitance) is formed by overtones (first, second, third) and combination fundamental bands which appear in the MIR (Middle Infrared). Other problem in this type of spectra is the high level of overlaping, and the NIR or NIT contains a lot of hidden information which needs from statistic or mathematical treatments to make them visible in a way that is usefull to develop models. This requires the use of preprocessing methods (centering, scaling, resolution improvement,...), that will be treated along the coming posts.

By the moment we will deal with the spectra without any treatment ("raw spectra"), and we create a correlation matrix to check the collinearity.

correlation <- cor(absorp)
library(corrplot)
corrplot(correlation)




As we can see, all the wavelengths are highly correlated, so we have to reduce some way the number of variables (predictors), looking for other variables,  not correlated and which capture most the variability contained in the spectra matrix without introducing noise. This can be done with the principal component analysis (PCA), and depending on the number of samples we have, we can use as many new variables (we will call them "terms") as predictors, but this will never be the case and we have to decide how many terms we will keep in order not to loose information (underfitting) and do not introduce noise (overfitting).


Before to run the principal component analysis we divide the whole data set into a train set and a test set.

Creating a training/test split: Normally we use 75% random samples for the training set and the remaining 25% for the test set

set.seed(1234)
#Create partition index
data_split <- createDataPartition(endpoints[ ,1], p = .75)
data_split <- data_split$Resample1

# split data
absorp_train <- absorp[data_split, ]
absorp_test <- absorp[-data_split, ]



The test set is taken apart and won´t be used until necessary (by the end of the models development). Refuse to use it to take decisions, it must be totally independent.

When running the PCA we have the option to use some pretreatments

  • Center: The mean spectrum is sustracted from every spectrum.
  • Scale  :  Every spectrum data point is divided by the standard deviation of all the data points in the spectrum.

We can see how center and scale affect to the spectra:

train_scaled <- scale(absorp_train, center = TRUE, 
                     scale = TRUE )
matplot(seq(850, 1048, by = 2), t(train_scaled),
        xlab = "Wavelengths", ylab = "Absorbance",
        main = "Meat spectra", type = "l")




These pretreatments are aplyed to every spectrum individually, so at the end of the preprocess (before developing the PCA) we have a transformed data matrix with the same dimensions (215 . 100).

pca_object <- prcomp(absorp_train, center = TRUE, 
                     scale. = TRUE)
percent_variance <- pca_object$sdev^2/sum(pca_object$sd^2)*100
plot(percent_variance[1:20])
head(percent_variance)
head(pca_object$x[1:5 , 1:2 ])
#scores with just 2 terms



We can see that just a few terms ( two or three) are enough to retain almost all the variance, but let´s calculate this time with Caret the principal components. First we have to give column names to the absorp_train matrix:

wavelengths <- seq(850, 1048, by = 2)
colnames(absorp_train) <- wavelengths
colnames(absorp_train)

trans <- preProcess(absorp_train, 
                    method = c("center", "scale", "pca"))
trans #info about the PCA calculation
transformed <- predict(trans, absorp_train)
head(transformed[1:2 , ]) 
#scores with the 2 recommended terms (95% variance explained)

by default the threshold for the maximum explained variance variance is 95%, but we can increase it if necessary. (in the coming post we will apply other methods to see if applying a correction to the skewness of the predictors improves the results).

Both methods give the same score and loading matrix that we will interpret in a coming post.

5 sept 2022

NIT Spectroscopic Tutorial with Caret (part 1)

Along several posts we will practice the use of Caret with spectroscopic data from an Infratec Instrument. The first thing we will do, is to load the library and the data which contain the spectroscopic and reference values.

#Loading the data
    library(caret)
    data(tecator)
    ?caret::tecator #Details about the tecator data

Two tables are loaded, one with the spectroscopic data (absorp), and another with the reference values (endpoints).

        dim(endpoints)
   dim(absorp)

We can see that we have 215 samples measured at 100 data points in the spectroscopic data, and three parameters with reference values.

The data points goes from 850 to 1048 every two nm, so they are in total 100 data points, and the reference values are for Moisture, Fat and Protein. As we can read in the Help info, the samples are chopped pure meat measured in transmittance.

We can see first the structure of the data is numeric

    str(absorp)
  str(endpoints)

and their class (matrix)

    class(absorp)
    class(endpoints)

We have a matrix for the absorption values and a matrix for the reference values. Let´s see the raw spectra

    matplot(seq(850, 1048, by = 2), t(absorp),
                 xlab = "Wavelengths", ylab = "Absorbance",
                 main = "Meat spectra", type = "l")

As we can see, due to the particle size, differences in the pathlength and other physical circumstances, there is quite a lot scatter in the spectra, so this physical variables are with others (chemical) data that we will have to treat and manage to develop models.

13 ago 2022

Soviet Space Dogs

 


Thanks to Duncan Geere for sharing this nice database. It is really a very interesting period before Yuri Gagarin flight.

These dogs were found in the street and trained to go to the space in really very hard conditions.

I use the template to prepare this graph where we see their ocupants (normally two dogs) and how in the sixtys several trips had succedd going into orbit for several days.

There are 5 flights not included due that the altitude of the flight is unknown.



6 ago 2022

#tidytuesday:Eurovision (2022 Final Televote Votes)

This is the clustering for the televotes:
See the group itself one for Ukraine, and see the case for Serbia.


#tidytuesday:Eurovision (2022 Final Jury Votes)

There is some polemic with the votes in Eurovision Final 2022 with some agreements possibly within some countries. Data Science can help us to see patterns within the countries, checking the clusters and with that base investigate more.

I show the cluster dendogram for the Final Votes from the Jury this year 2022.




4 ago 2022

#tidytuesday:Eurovision (Spanish Artist and their positions)


 As we see in the plot Conchita Velasco and Raphael repeat representing Spain on Eurovision. The colour in the plots show their position (greener the better).

spanish_artists <-
eurovision_1 %>%
filter(section=="final"| section=="grand-final") %>%
filter(artist_country == "Spain") %>%
select(year, artist, song, host_city, host_country, rank)


spanish_artists %>%
group_by(year) %>%
mutate(artist = fct_reorder(artist, year)) %>%
ggplot(aes(x = year, y = artist)) +
geom_tile(aes(fill = rank), colour = "red") +
scale_fill_gradient(low="green", high="red", limits=c(1, 26)) +
scale_x_continuous(breaks = seq(1956, 2022)) +

theme(axis.text.x = element_text
                   (angle = 90, vjust = 0.5, hjust = 1)) +
labs(y = "Spanish artist on Eurovision by year", y = "Year")

3 ago 2022

#tidytuesday:Eurovision (What happened in 1956?)

Another special year. It was the first edition of Eurovision Festival with seven participant countries (founders) and two artists representing each country with different songs, except  Switzerland and Luxemburg with just one singer  for the two songs. 

Lys Assia from Switzerland was the winner.

There is a nice explanation on Wikipedia about that edition.

eurovision_1 %>%
    filter(year == 1956) %>%
    count(artist_country, sort = TRUE)
 A tibble: 7 × 2
artist_country               n
<chr>                      <int>
Belgium                      2
France                       2
Germany                      2
Italy                        2
Luxembourg                   2
Netherlands                  2

Switzerland                  2

eurovision_1 %>% 
+   filter(year == 1956) %>% 
+   count(artist, artist_country, song, winner, sort = TRUE) 
# A tibble: 14 × 5
   artist                 artist_country song                            winner     n
   <chr>                  <chr>          <chr>                           <lgl>  <int>
 1 Corry Brokken          Netherlands    Voorgoed Voorbij                FALSE      1
 2 Dany Dauberson         France         Il Est Là                       FALSE      1
 3 Franca Raimondi        Italy          Aprite Le Finestre              FALSE      1
 4 Freddy Quinn           Germany        So Geht Das Jede Nacht          FALSE      1
 5 Fud Leclerc            Belgium        Messieurs Les Noyés De La Seine FALSE      1
 6 Jetty Paerl            Netherlands    De Vogels Van Holland           FALSE      1
 7 Lys Assia              Switzerland    Das Alte Karussell              FALSE      1
 8 Lys Assia              Switzerland    Refrain                         TRUE       1
 9 Mathé Altéry           France         Le Temps Perdu                  FALSE      1
10 Michèle Arnaud         Luxembourg     Les Amants De Minuit            FALSE      1
11 Michèle Arnaud         Luxembourg     Ne Crois Pas                    FALSE      1
12 Mony Marc              Belgium        Le Plus Beau Jour De Ma Vie     FALSE      1
13 Tonina Torielli        Italy          Amami Se Vuoi                   FALSE      1
14 Walter Andreas Schwarz Germany        Im Wartesaal Zum Großen Glück   FALSE      1

All this must  be taken into account when working with some statistics with the Eurovision data base.


2 ago 2022

#tidytuesday:Eurovision (What happened in 1969?)

It was a special year with four winners:

eurovision_1 %>%
filter(section=="final"| section=="grand-final", winner == TRUE) %>%
count(winner, year, host_country, sort = TRUE)


# A tibble: 66 × 4
   winner  year host_country       n
   <lgl>  <dbl> <chr>          <int>
 1 TRUE    1969 Spain              4
 2 TRUE    1956 Switzerland        1
 3 TRUE    1957 Germany            1
 4 TRUE    1958 Netherlands        1
 5 TRUE    1959 France             1
 6 TRUE    1960 United Kingdom     1
 7 TRUE    1961 France             1
 8 TRUE    1962 Luxembourg         1
 9 TRUE    1963 United Kingdom     1
10 TRUE    1964 Denmark            1
# … with 56 more rows

 50 years ago today: Four winners at Eurovision 1969 in Madrid

#tidytuesday: Eurovision

 Another question we can ask about the history of Eurovision Festival is: which was the country which host the festival more often?

Normally the host country is the one which won the previous year, but for different reasons some years is not like that. That will be the case for 2023 where the host country will be United Kingdom (2nd place in 2022) indeed Ukraine for obvious reasons (1st place in 2022). 

As we can see in the plot UK is the country which host more ofen the Eurovision Festival.

countries_events <-
unique(eurovision_1 %>%
group_by(host_country) %>%
summarize(host_country, event, year))


countries_events %>%
count(host_country, sort = TRUE) %>%
ggplot(aes(x = reorder(host_country, n), n)) +
geom_col() +
coord_flip()



#tidytuesday: Eurovision

From today there is a new label in the blog to work with the tidytuesday data available from tidytuesday.

There are a lot of tutorials to learn to tidy our data to prepare it  to extract information, and to develop models (I normally see the videos of Julia Silge and David Robinson), so even if these post are not much related to the world of chemometrics they will help us to have fun with the "R" language, get better skills and habits that we will use it when necessary to work with R in our daily work.

I select the Eurovision data in this case, and the first thing I ask myself was to try to find a plot which shows me the evolution of the Eurovision contest

eurovision_1 %>%
    count(year,section, sort = TRUE) %>%
    ggplot(aes(year, n, fill = section )) +
    geom_col(width = 1, colour = "black") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.1)) +
    scale_x_continuous(breaks = seq(1956, 2022)) +
    theme(axis.text=element_text(size=7))  +
    coord_flip(
)


The plot shows the different periods of the Eurovision contest seeing how the number of countries increased almost every year. We see when the system of semi-finals started and several curious things, that we will try to discover in the label: tidytuesday:Eurovision.


28 jul 2022

Convert Nanometers to Wavenumbers

 All we know that the wavelength scale (X axis) is different for a dispersive NIR or a FTNIR. Sometimes, for some reasons we want to convert the wavelength scale of a NIR spectra database,  to the wavenumber scale or vice versa. 

In this case I apply with R the conversion of the NIRsoil$spc data (from the "prospectr" package) from nanometers :



to wavenumbers (cm-1)


We just have to apply the formula:
wn <- 1/(x*10^-7)
Where x is the value in "nm" of every element for the range of our NIR spectrum, in this case 1100 to 2498 with an interval of 2 nm.



12 jul 2022

NIR RMS between subsamples

One of the advantages of R, is that you can develop a function to apply to your calculations to get your purpose. In this case I wanted to apply the RMS calculation to every row of the difference matrix we have seen in the previous post.

We can see with an histogram the distribution of the RMS values



11 jul 2022

Diference between subsamples

Once we have sorted and grouped all the samples and subsamples, we can create two matrices one that contains the odd samples (first subsample) and other with the even subsamples (second subsample), this way we can calculate the spectra difference between subsamples and check the spectra.

This difference spectra matrix is important to calculate the RMS between subsamples, that gave us an idea how similar are the subsamples between them. Just sum the squared values for every spectrum difference, divide by the number of wavelengths and calculate the square root.






6 jul 2022

Soil clay regressions: Looking for the better accuracy (part 3)

 Now we have four calibrations for clay in soil developed with a selection of samples from the LUCAS database (Spanish crop soils). Now we want to see if those calibrations predict with certain accuracy a new set of soil samples (155) from a Spanish region, acquired in a different instrument and at a different laboratory.

In these cases, it is normal to expect a bias or a slope in the predictions, so we can use the model with those adjustments applied, until the database is updated and a new expanded method with new variability (instrument, laboratory, region,) developed.

Well, these are the results of the XY plots  "Lab vs Predictions" for this independent data set:

PLS predictions:


Random Forest Predictions

Cubist Predictions


MBL Predictions


As we see in all the cases some adjustment or calibration update is needed. We can try to reprocess everything trying to find a better configuration which improves these values, but in that case this new set will never be independent again like it is now. 



5 jul 2022

Soil clay regressions: Looking for the better accuracy (part 2)

Let´s select a seed (to fix the training and test set) and develop the regression with four algorithms (PLS, Random Forest, Cubist and Memory Based Learning.

These are the Test Validation XY plots and statistics:

PLS Regression:


Random Forest Regression:


Cubist Regression:


Memory Based Learning Regression


We can see PLS give the better RMSEP, but some samples are outside the Action Limits Warning, while that in the MBL the residual distribution is more stable and there are no samples outside the action limits threshold.

Questions about NIR Modelling (001)

Sometimes I receive mails from the readers, that are very interesting, so I create this post to answer the reader and to keep the post to create comments or add what the readers consider about their own experience.

 

The choice of the wavelength corresponding to the studied parameter (is it better to keep all the scan or to choose a part which represents the targeted parameter? If any, how to do so?)

Normally all the scan is used, and the PLS algorithm, latent variables (PLS terms) which represents the spectral variance and their covariance with the studied parameters. Looking to the regression coefficients, and knowing the wavelengths at which those correlation absorb, you can try to interpret the regression and decide if certain wavelength zones could be excludes (as flat zero zones, …..). Regression coefficients are very difficult to interpret due to the math treatments applied (specially derivatives).

Other option is to choose a few specific wavelengths, when normally the first one is the one at which the parameter absorbs (example: 1940 nm for the water) and continue adding wavelengths of other constituents that interfere with the water, or zones that do not absorb, but scatter is observed. Normally the software helps you with these selections, and you have always the statistics to see if the wavelengths added improve the regression. Normally this type of algorithm is called MLR (Multiple Linear Regression).

 

How to split the samples between calibration and validation (is there a test to do?)

Split randomly 80% of the samples to the Training Set, and the remain 20% to the Test Set. In the case you have a lot of samples from different years you can uses other approaches (the older samples for calibration and the new ones for validation, …..). Anyway, if the calibration is robust, you should get similar results.

 

The criteria for choosing the tests to be performed for pre-processing (2nd derivative, SNV, MSC, etc.).

Normally the criteria is to choose the simplest math treatment. For the scatter, if you have a lot od samples and all the possible variability represented you can try MSC, if not one of the best options is to combine SNV and Detrend.

First derivative is difficult to interpret (the maximum for the raw spectra, becomes a zero crossing), second derivative is better interpretable. The important option is the gap you use (not to long because you can loose information, and not too short because you add noise).