Sometimes I
receive mails from the readers, that are very interesting, so I create this
post to answer the reader and to keep the post to create comments or add what
the readers consider about their own experience.
The
choice of the wavelength corresponding to the studied parameter (is it better
to keep all the scan or to choose a part which represents the targeted
parameter? If
any, how to do so?)
Normally all the scan is used, and the PLS algorithm, latent
variables (PLS terms) which represents the spectral variance and their covariance
with the studied parameters. Looking to the regression coefficients, and
knowing the wavelengths at which those correlation absorb, you can try to
interpret the regression and decide if certain wavelength zones could be
excludes (as flat zero zones, …..). Regression coefficients are very difficult
to interpret due to the math treatments applied (specially derivatives).
Other option is to choose a few specific wavelengths,
when normally the first one is the one at which the parameter absorbs (example:
1940 nm for the water) and continue adding wavelengths of other constituents
that interfere with the water, or zones that do not absorb, but scatter is
observed. Normally the software helps you with these selections, and you have always
the statistics to see if the wavelengths added improve the regression. Normally
this type of algorithm is called MLR (Multiple Linear Regression).
How
to split the samples between calibration and validation (is there a test to
do?)
Split randomly 80% of the samples to the Training Set,
and the remain 20% to the Test Set. In the case you have a lot of samples from
different years you can uses other approaches (the older samples for
calibration and the new ones for validation, …..). Anyway, if the calibration
is robust, you should get similar results.
The criteria for choosing the tests to
be performed for pre-processing (2nd derivative, SNV, MSC, etc.).
Normally the criteria is to choose the simplest math
treatment. For the scatter, if you have a lot od samples and all the possible
variability represented you can try MSC, if not one of the best options is to
combine SNV and Detrend.
First derivative is difficult to interpret (the maximum
for the raw spectra, becomes a zero crossing), second derivative is better
interpretable. The important option is the gap you use (not to long because you
can loose information, and not too short because you add noise).