Variable Selection in N-PLS

Autores UPV
Año
CONGRESO Variable Selection in N-PLS

Abstract

Variable selection appears as one paramount objective in different fields, such as ¿omics or image analysis. This dissertation presents two case studies: the first related to the inclusion of the LASSO method into the NPLS algorithm, in order to be able to shrink any of the modes related to the variables; the second one related to the reduction of the number of wavelengths in a hyperspectral camera in order to use 3 to 5 filters in a modified commercial camera. When large amounts of data are available, introducing too few relevant parameters may produce some bias in the estimations, whereas the opposite may increment the variance unnecessarily. One way of dealing with these data, when N-way arrays are available and some variable is to be predicted, is applying N-PLS, which segregates the parameters producing a more parsimonious model. In order to further select the relevant variables and obtain simpler models for improving the interpretability and/or the prediction capabilities, different selection techniques can be applied. In this work, we present the introduction of LASSO in N-PLS. LASSO shrinks the coefficients of the model, causing some of them to be exactly zero and thus performing variable selection at the same time. We compare the results provided by LASSO-N-PLS with the creation of random null distributions of VIP¿s and weights, with posterior calculation of the statistical significance. The method showed good ability for the reliable selection of those important variables comprised in large -omic data sets. The second case deals with the detection of rotten oranges in the Valencian fruit export. These exports reached ¿2.7 billion in 2014. Therefore, it is of paramount importance to control the absence of rotten oranges in the warehouses, since they can infect other fruits and spoil hundreds of kg. Here we use hyperspectral images to discriminate between rotten and healthy oranges. A NPLS model is built based on a set of LCTF hyperspectral images taken from oranges from different varieties. Some oranges were infected by a fungus and others were infiltrated just with water. For each variety and wavelength a set of features are extracted, defining thus a three way X data set per variety: orange sample times features times wavelengths. After splitting the data in calibration and external test sets, a N-way Partial Least Squares Discriminant Analysis (NPLS-DA) model is applied for each variety, selecting also the very few wavelengths offering the best correct classification rates in the validation set.