28 jul 2022

Convert Nanometers to Wavenumbers

 All we know that the wavelength scale (X axis) is different for a dispersive NIR or a FTNIR. Sometimes, for some reasons we want to convert the wavelength scale of a NIR spectra database,  to the wavenumber scale or vice versa. 

In this case I apply with R the conversion of the NIRsoil$spc data (from the "prospectr" package) from nanometers :



to wavenumbers (cm-1)


We just have to apply the formula:
wn <- 1/(x*10^-7)
Where x is the value in "nm" of every element for the range of our NIR spectrum, in this case 1100 to 2498 with an interval of 2 nm.



12 jul 2022

NIR RMS between subsamples

One of the advantages of R, is that you can develop a function to apply to your calculations to get your purpose. In this case I wanted to apply the RMS calculation to every row of the difference matrix we have seen in the previous post.

We can see with an histogram the distribution of the RMS values



11 jul 2022

Diference between subsamples

Once we have sorted and grouped all the samples and subsamples, we can create two matrices one that contains the odd samples (first subsample) and other with the even subsamples (second subsample), this way we can calculate the spectra difference between subsamples and check the spectra.

This difference spectra matrix is important to calculate the RMS between subsamples, that gave us an idea how similar are the subsamples between them. Just sum the squared values for every spectrum difference, divide by the number of wavelengths and calculate the square root.






6 jul 2022

Soil clay regressions: Looking for the better accuracy (part 3)

 Now we have four calibrations for clay in soil developed with a selection of samples from the LUCAS database (Spanish crop soils). Now we want to see if those calibrations predict with certain accuracy a new set of soil samples (155) from a Spanish region, acquired in a different instrument and at a different laboratory.

In these cases, it is normal to expect a bias or a slope in the predictions, so we can use the model with those adjustments applied, until the database is updated and a new expanded method with new variability (instrument, laboratory, region,) developed.

Well, these are the results of the XY plots  "Lab vs Predictions" for this independent data set:

PLS predictions:


Random Forest Predictions

Cubist Predictions


MBL Predictions


As we see in all the cases some adjustment or calibration update is needed. We can try to reprocess everything trying to find a better configuration which improves these values, but in that case this new set will never be independent again like it is now. 



5 jul 2022

Soil clay regressions: Looking for the better accuracy (part 2)

Let´s select a seed (to fix the training and test set) and develop the regression with four algorithms (PLS, Random Forest, Cubist and Memory Based Learning.

These are the Test Validation XY plots and statistics:

PLS Regression:


Random Forest Regression:


Cubist Regression:


Memory Based Learning Regression


We can see PLS give the better RMSEP, but some samples are outside the Action Limits Warning, while that in the MBL the residual distribution is more stable and there are no samples outside the action limits threshold.

Questions about NIR Modelling (001)

Sometimes I receive mails from the readers, that are very interesting, so I create this post to answer the reader and to keep the post to create comments or add what the readers consider about their own experience.

 

The choice of the wavelength corresponding to the studied parameter (is it better to keep all the scan or to choose a part which represents the targeted parameter? If any, how to do so?)

Normally all the scan is used, and the PLS algorithm, latent variables (PLS terms) which represents the spectral variance and their covariance with the studied parameters. Looking to the regression coefficients, and knowing the wavelengths at which those correlation absorb, you can try to interpret the regression and decide if certain wavelength zones could be excludes (as flat zero zones, …..). Regression coefficients are very difficult to interpret due to the math treatments applied (specially derivatives).

Other option is to choose a few specific wavelengths, when normally the first one is the one at which the parameter absorbs (example: 1940 nm for the water) and continue adding wavelengths of other constituents that interfere with the water, or zones that do not absorb, but scatter is observed. Normally the software helps you with these selections, and you have always the statistics to see if the wavelengths added improve the regression. Normally this type of algorithm is called MLR (Multiple Linear Regression).

 

How to split the samples between calibration and validation (is there a test to do?)

Split randomly 80% of the samples to the Training Set, and the remain 20% to the Test Set. In the case you have a lot of samples from different years you can uses other approaches (the older samples for calibration and the new ones for validation, …..). Anyway, if the calibration is robust, you should get similar results.

 

The criteria for choosing the tests to be performed for pre-processing (2nd derivative, SNV, MSC, etc.).

Normally the criteria is to choose the simplest math treatment. For the scatter, if you have a lot od samples and all the possible variability represented you can try MSC, if not one of the best options is to combine SNV and Detrend.

First derivative is difficult to interpret (the maximum for the raw spectra, becomes a zero crossing), second derivative is better interpretable. The important option is the gap you use (not to long because you can loose information, and not too short because you add noise).


4 jul 2022

Soil clay regressions: Looking for the better accuracy (part 1)

To have a good traceability in our data bases is important to develop accurate calibrations. We have seen with the Soil LUCAS database that we can filter the data by sample origin (Spain for my case) and after that filter it by land type (I choose for this example “Croplands). The samples are split into training and test set randomly.

 After that we must decide if we choose all the wavelength range (VIS + NIR) or the just the NIR. In this case the calibration is for Clay and different test tell me that the complete range is the best option.

 After this is time to check for the best math treatment trying in this case with 2º SG derivative, 1º SG derivative and SNV+Detrend scatter correction. The last two options gave me better validation statistics than the 2º SG derivative when using a PLS regression

 This is the XY plot for the validation set:


Can we improve the results with another type of regression for these cropland samples? This is what we will see in the next coming posts.

Another question can be: May I use this database to predict samples from another database? These can be samples from a different area in the same country, taken with a different instrument and analysed in a different laboratory. Check all this is important to see the robustness of the calibrations.