21 feb 2022

To consider for the Mahalanobis distance calculation

 As we saw in the post "Try to find high content gypsum samples (part 3)" , when we develop the principal component analysis as much variance as possible (we determine the explained variance limit) is explained, and unless we saw all the sores maps, we may have not a good idea about what is happening to our data.

The average spectrum represents all the groups, so we cannot expect that with the Mahalanobis distance all the samples with gypsum will be marked as outliers, because the average spectrum represents as well those samples. Only the ones with high and very high gypsum content can probably be marked as outliers by Mahalanobis distance. Now that we have make two samples sets ("gypsum content" and "non or low gypsum content"), we can check the sample sets separately to understand better our data.

Let´s check again the Mahalanobis distance plot once we have, by spectra visualization and discrimination by correlation, the two sample sets ("No" and "Yes" gypsum content):


See how some of the gypsum content samples are over 3 Mahalanobis cut-off (but not all).
Now that the "No" samples are in a new sample set, we can see their spectra:

There are some samples that seem outliers, but all the rest seem to group quite well together, anyway we wait to see the PC score maps:
Now we see different patterns.

Let´s see finally the new Mahalanobis distances:








7 feb 2022

In the Soil Pit - 1 with Professor Ray Weil: Soil Horizons

Who best than professor Ray Weil to give us soil lessons?. Now we have that opportunity in Youtube. 
Sure that if we follow this videos we will see the soil with other eyes this soil pits when we found them in our walks.

Trying to find high content gypsum samples (part 3)

 If we go back to the post: "PCA with the first derivative" we saw how there was some groupings (in the PC2 vs. the rest PC maps) in the scores, after, when looking to second loading we though that could be the samples with gypsum content. 

In the last two posts : "Trying to find high content gypsum samples (part 1)" and "Trying to find high content gypsum samples (part 2)",  we found those samples by correlation with a Gypsum reference spectra, and now we can create a new variable called "gypsum" which takes the factor value "Yes" or  "No", so now we can see easier in the score maps if that grouping was due to the content of gypsum:

pairs(scores_1df[ ,1:6], col = scores_1df$gypsum)



Trying to find high content gypsum samples (part 2)

 If you have read the previous post (Trying to find high content gypsum samples), now it is time for the fine tuning in a visual way. I take apart the samples  with a value higher than a certain correlation (0.70 in this case), and the other samples goes to a new sample set, this way I hope to have the samples with high and low or no gypsum content in another. Finally, I plot them together and have a look.

cor3SG1 <- which(corSG1 > 0.70)
explore1 <- lucas_spain$spcnir_SG[cor3SG1, ]
explore2 <- lucas_spain$spcnir_SG[-cor3SG1, ]
matplot(colnames(lucas_spain$spcnir_SG),
        t(explore1), type = "l",lty =3, 
        xlab = "wavelength", ylab = "Absorbance", 
        col = "red", ylim = c(-0.025, 0.025))
par(new = TRUE)
matplot(colnames(lucas_spain$spcnir_SG), 
        t(explore2), type = "l", lty =3, 
        xlab = "wavelength", ylab = "Absorbance", 
        col = "blue",ylim = c(-0.025, 0.025))


We can have a closer view where the differences are more clear:

Now we can study apart the blue samples (low or no gypsum content) looking for other groupings and the red ones (gypsum content).


6 feb 2022

Trying to find high content gypsum samples

How can I find the high content gypsum samples in the LUCAS spanish database?. We have the spectrum of pure gypsum, so we can try to correlate (or meassure a distance) every spectrum of the database with the pure gypsum spectra and see if we find some threshold or gap that splits these high content gypsum samples. 

I try in this case the correlation, because it is the easiest algorithm, and  I don´t have a population of gypsum soil spectra (just one reference pure spectrum).

Let´s run the correlation:

corSG1 <- as.numeric(cor(t(lucas_spain$spcnir_SG[,21:546]),
                     mineralRef_nir_2nm_SG1[3, 21:546]))

Now we have a correlation value for every spectrum, so we can plot the histogram:

hist(corSG1, breaks = 1000)

We can see the distribution and the tail on the right with some grouping over 0.85, so this can be a way to look at the samples over 0.8 correlation and make a fine tune selection.