A note on knowledge discovery and machine learning in digital soil mapping
Wadoux, Alexandre M.J.C. ; Samuel-Rosa, Alessandro ; Poggio, Laura ; Mulder, Vera Leatitia - \ 2020
European Journal of Soil Science 71 (2020)2. - ISSN 1351-0754 - p. 133 - 136.
mapping - pedometrics - random forest - soil science - variable selection
In digital soil mapping, machine learning (ML) techniques are being used to infer a relationship between a soil property and the covariates. The information derived from this process is often translated into pedological knowledge. This mechanism is referred to as knowledge discovery. This study shows that knowledge discovery based on ML must be treated with caution. We show how pseudo-covariates can be used to accurately predict soil organic carbon in a hypothetical case study. We demonstrate that ML methods can find relevant patterns even when the covariates are meaningless and not related to soil-forming factors and processes. We argue that pattern recognition for prediction should not be equated with knowledge discovery. Knowledge discovery requires more than the recognition of patterns and successful prediction. It requires the pre-selection and preprocessing of pedologically relevant environmental covariates and the posterior interpretation and evaluation of the recognized patterns. We argue that important ML covariates could serve the purpose of providing elements to postulate hypotheses about soil processes that, once validated through experiments, could result in new pedological knowledge. Highlights: We discuss the rationale of knowledge discovery based on the most important machine learning covariates We use pseudo-covariates to predict topsoil organic carbon with random forest Soil organic carbon was accurately predicted in a hypothetical case study Pattern recognition by random forest should not be equated to knowledge discovery.
Response variable selection in principal response curves using permutation testing
Vendrig, Nadia J. ; Hemerik, Lia ; Braak, Cajo J.F. Ter - \ 2017
Aquatic Ecology 51 (2017)1. - ISSN 1386-2588 - p. 131 - 143.
longitudinal data - multivariate analysis - multivariate time series - permutation testing - Principal response curves - variable selection
Principal response curves analysis (PRC) is widely applied to experimental multivariate longitudinal data for the study of time-dependent treatment effects on the multiple outcomes or response variables (RVs). Often, not all of the RVs included in such a study are affected by the treatment and RV-selection can be used to identify those RVs and so give a better estimate of the principal response. We propose four backward selection approaches, based on permutation testing, that differ in whether coefficient size is used or not in ranking the RVs. These methods are expected to give a more robust result than the use of a straightforward cut-off value for coefficient size. Performance of all methods is demonstrated in a simulation study using realistic data. The permutation testing approach that uses information on coefficient size of RVs speeds up the algorithm without affecting its performance. This most successful permutation testing approach removes roughly 95 % of the RVs that are unaffected by the treatment irrespective of the characteristics of the data set and, in the simulations, correctly identifies up to 97 % of RVs affected by the treatment.
Do more detailed environmental covariates deliver more accurate soil maps?
Samuel Rosa, A. ; Heuvelink, G.B.M. ; Vasques, G.M. ; Anjos, L.H.C. - \ 2015
Geoderma 243-244 (2015). - ISSN 0016-7061 - p. 214 - 227.
Digital soil mapping - Linear mixed model - auxiliary information - variable selection - model accuracy - soil mapping cost
In this study we evaluated whether investing in more spatially detailed environmental covariates improves the accuracy of digital soil maps. We used a case study from Southern Brazil to map clay content (CLAY), organic carbon content (SOC), and effective cation exchange capacity (ECEC) of the topsoil for a ~ 2000 ha area located on the edge of the plateau of the Paraná Sedimentary Basin. Five covariates, each with two levels of spatial detail were used: area-class soil maps, digital elevation models (DEM), geologic maps, land use maps, and satellite images. Thirty-two multiple linear regression models were calibrated for each soil property using all spatial detail combinations of the covariates. For each combination, stepwise regression was used to select predictor variables incorporated in the model. Model evaluation was done using the adjusted R-square of the regression. The baseline model, calibrated with the less detailed version of each covariate, and the best performing model were used to calibrate two linear mixed models for each soil property. Model parameters were estimated using restricted maximum likelihood. Spatial prediction was performed using the empirical best linear unbiased predictor. Validation of baseline and best performing linear multiple regression and linear mixed models was done using cross-validation. Results show that for CLAY the prediction accuracy did not considerably improve by using more detailed covariates. The amount of variance explained increased only ~ 2 percentage points (pp), less than that obtained by including the kriging step, which explained 4 pp. On the other hand, prediction of SOC and ECEC improved by ~ 13 pp when the baseline model was replaced by the best performing model. Overall, the increase in prediction performance was modest and may not outweigh the extra costs of using more detailed covariates. It may be more efficient to spend extra resources on collecting more soil observations, or increasing the detail of only those covariates that have the strongest improvement effect. In our case study, the latter would only work for SOC and ECEC, by investing in a more detailed land use map and possibly also a more detailed geologic map and DEM.
A comparison of principal component regression and genomic REML for genomic prediction across populations
Dadousis, C. ; Veerkamp, R.F. ; Heringstad, B. ; Pszczola, M.J. ; Calus, M.P.L. - \ 2014
Genetics, Selection, Evolution 46 (2014). - ISSN 0999-193X - 14 p.
breeding values - multi-breed - short communication - variable selection - wide association - genetic value - dairy-cattle - data sets - accuracy - information
Background Genomic prediction faces two main statistical problems: multicollinearity and n¿«¿p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to compare the performance of PC regression (PCR) for genomic prediction with that of a commonly used REML model with a genomic relationship matrix (GREML) and to investigate the full potential of PCR for genomic prediction. Methods The PCR model used either a common or a semi-supervised approach, where PC were selected based either on their eigenvalues (i.e. proportion of variance explained by SNP (single nucleotide polymorphism) genotypes) or on their association with phenotypic variance in the reference population (i.e. the regression sum of squares contribution). Cross-validation within the reference population was used to select the optimum PCR model that minimizes mean squared error. Pre-corrected average daily milk, fat and protein yields of 1609 first lactation Holstein heifers, from Ireland, UK, the Netherlands and Sweden, which were genotyped with 50 k SNPs, were analysed. Each testing subset included animals from only one country, or from only one selection line for the UK. Results In general, accuracies of GREML and PCR were similar but GREML slightly outperformed PCR. Inclusion of genotyping information of validation animals into model training (semi-supervised PCR), did not result in more accurate genomic predictions. The highest achievable PCR accuracies were obtained across a wide range of numbers of PC fitted in the regression (from one to more than 1000), across test populations and traits. Using cross-validation within the reference population to derive the number of PC, yielded substantially lower accuracies than the highest achievable accuracies obtained across all possible numbers of PC. Conclusions On average, PCR performed only slightly less well than GREML. When the optimal number of PC was determined based on realized accuracy in the testing population, PCR showed a higher potential in terms of achievable accuracy that was not capitalized when PC selection was based on cross-validation. A standard approach for selecting the optimal set of PC in PCR remains a challenge.
Whole-genome regression and prediction methods applied to plant and animal breeding
Los Campos, G. De; Hickey, J.M. ; Pong-Wong, R. ; Daetwyler, H.D. ; Calus, M.P.L. - \ 2013
Genetics 193 (2013)2. - ISSN 0016-6731 - p. 327 - 345.
marker-assisted selection - quantitative trait locus - genetic-relationship information - single nucleotide polymorphisms - linear unbiased prediction - dense molecular markers - dairy-cattle - variable selection - reference population - beef-cattle
Genomic-enabled prediction is becoming increasingly important in animal and plant breeding, and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of markers concurrently. Following the groundbreaking contribution of MEUWISSEN et al. (2001) several methods have been proposed and evaluated, and genome-enabled selection (GS) is being implemented in several plant and animal breeding programs. The list of methods is long, and the relationships between the available methods have not been fully addressed. In this article we provide an overview of available methods for implementing parametric WGR models, discuss selected topics which emerge in the application of these methods and present a general discussion of lessons learnt from simulation and empirical data analysis in the last decade
Selection properties of Type II maximum likelihood (empirical bayes) linear models with individual variance components for predictors
Jamil, T. ; Braak, C.J.F. ter - \ 2012
Pattern Recognition Letters 33 (2012)9. - ISSN 0167-8655 - p. 1205 - 1212.
gene-expression data - variable selection - elastic net - regression - regularization - shrinkage - chemometrics - networks - genome - lasso
Maximum Likelihood (ML) in the linear model overfits when the number of predictors (M) exceeds the number of objects (N). One of the possible solution is the Relevance vector machine (RVM) which is a form of automatic relevance detection and has gained popularity in the pattern recognition machine learning community by the famous textbook of Bishop (2006). RVM assigns individual precisions to weights of predictors which are then estimated by maximizing the marginal likelihood (type II ML or empirical Bayes). We investigated the selection properties of RVM both analytically and by experiments in a regression setting. We show analytically that RVM selects predictors when the absolute z-ratio (|least squares estimate|/standard error) exceeds 1 in the case of orthogonal predictors and, for M = 2, that this still holds true for correlated predictors when the other z-ratio is large. RVM selects the stronger of two highly correlated predictors. In experiments with real and simulated data, RVM is outcompeted by other popular regularization methods (LASSO and/or PLS) in terms of the prediction performance. We conclude that Type II ML is not the general answer in high dimensional prediction problems. In extensions of RVM to obtain stronger selection, improper priors (based on the inverse gamma family) have been assigned to the inverse precisions (variances) with parameters estimated by penalized marginal likelihood. We critically assess this approach and suggest a proper variance prior related to the Beta distribution which gives similar selection and shrinkage properties and allows a fully Bayesian treatment.
Canonical correlation analysis of multiple sensory directed metabolomics data blocks reveals corresponding parts between data blocs
Doeswijk, T.G. ; Hageman, J.A. ; Westerhuis, J.A. ; Tikunov, Y.M. ; Bovy, A.G. ; Eeuwijk, F.A. van - \ 2011
Chemometrics and Intelligent Laboratory Systems 107 (2011)2. - ISSN 0169-7439 - p. 371 - 376.
variable selection - component analysis - multiblock - quality - fusion - models - pls
Multiple analytical platforms are frequently used in metabolomics studies. The resulting multiple data blocks contain, in general, similar parts of information which can be disclosed by chemometric methods. The metabolites of interest, however, are usually just a minor part of the complete data block and are related to a response of interest such as quality traits. Concatenation of data matrices is frequently used to simultaneously analyze multiple data blocks. Two main problems may occur with this approach: 1) the number of variables becomes very large in relation to the number of observations which may deteriorate model performance, and 2) scaling issues between the data blocks need to be resolved. Therefore, a method is proposed that circumvents direct concatenation of two data matrices but does uncover the shared and distinct parts of the data sets in relation to quality traits. The relevant part of the data blocks with respect to the quality trait of interest is revealed by partial least squares regression on each of the data blocks. The score vectors of both models that are predictive for the quality trait are then used in a canonicalcorrelationanalysis. Highly correlating score vectors indicate parts of the data blocks that are closely related. By inspecting the relevant loading vectors, the metabolites of interest are revealed
Data-processing strategies for metabolomics studies
Hendriks, M.M.W.B. ; Eeuwijk, F.A. van; Jellema, R.H. ; Westerhuis, J.A. ; Reijmers, T.H. ; Hoefsloot, H.C.J. ; Smilde, A.K. - \ 2011
TrAC : Trends in Analytical Chemistry 30 (2011)10. - ISSN 0165-9936 - p. 1685 - 1698.
principal component analysis - mass-spectrometry - variable selection - optimal-design - models - identification - metabolites - networks - tool - nmr
Metabolomics studies aim at a better understanding of biochemical processes by studying relations between metabolites and between metabolites and other types of information (e.g., sensory and phenotypic features). The objectives of these studies are diverse, but the types of data generated and the methods for extracting information from the data and analysing the data are similar. Besides instrumental analysis tools, various data-analysis tools are needed to extract this relevant information. The entire data-processing workflow is complex and has many steps. For a comprehensive overview, we cover the entire workflow of metabolomics studies, starting from experimental design and sample-size determination to tools that can aid in biological interpretation. We include illustrative examples and discuss the problems that have to be dealt with in data analysis in metabolomics. We also discuss where the challenges are for developing new methods and tailor-made quantitative strategies
Phenotypic selection on leaf ecophysiological traits in Helianthus
Donovan, L.A. ; Ludwig, F. ; Rosenthal, D.R. ; Rieseberg, L.H. ; Dudley, S.A. - \ 2009
New Phytologist 183 (2009)3. - ISSN 0028-646X - p. 868 - 879.
carbon-isotope discrimination - water-use efficiency - plant physiological traits - natural-selection - impatiens-capensis - genetic-variation - polygonum-arenastrum - differing selection - variable selection - functional traits
Habitats that differ in soil resource availability are expected to differ for selection on resource-related plant traits. Here, we examined spatial and temporal variation in phenotypic selection on leaf ecophysiological traits for 10 Helianthus populations, including two species of hybrid origin, Helianthus anomalus and Helianthus deserticola, and artificial hybrids of their ancestral parents. Leaf traits assessed were leaf size, succulence, nitrogen (N) concentration and water-use efficiency (WUE). Biomass and leaf traits of artificial hybrids indicate that the actively moving dune habitat of H. anomalus was more growth limiting, with lower N availability but higher relative water availability than the stabilized dune habitat of H. deserticola. Habitats differed for direct selection on leaf N and WUE, but not size or succulence, for the artificial hybrids. However, within the H. anomalus habitat, direct selection on WUE also differed among populations. Across years, direct selection on leaf traits did not differ. Leaf N was the only trait for which direct selection differed between habitats but not within the H. anomalus habitat, suggesting that nutrient limitation is an important selective force driving adaptation of H. anomalus to the active dune habitat
Automated procedure for candidate compound selection in GCMS metabolomics based on prediction of Kovats retention index
Mihaleva, V.V. ; Verhoeven, H.A. ; Vos, C.H. de; Hall, R.D. ; Ham, R.C.H.J. van - \ 2009
Bioinformatics 25 (2009)6. - ISSN 1367-4803 - p. 787 - 794.
volatile organic-compounds - variable selection - genetic algorithms - mass-spectrometry - descriptors - strategy - isomers - variety - cancer - time
Motivation: Matching both the retention index (RI) and the mass spectrum of an unknown compound against a mass spectral reference library provides strong evidence for a correct identification of that compound. Data on retention indices are, however, available for only a small fraction of the compounds in such libraries. We propose a quantitative structure - retention index model that enables the ranking and filtering of putative identifications of compounds for which the predicted RI falls outside a predefined window. Results: We constructed multiple linear regression and support vector regression (SVR) models using a set of descriptors obtained with a genetic algorithm as variable selection method. The SVR model is a significant improvement over previous models built for structurally diverse compounds as it covers a large range (360 to 4100) of RI values and gives better prediction of isomer compounds. The hit list reduction varied from 41% to 60% and depended on the size of the original hit list. Large hit lists were reduced to a greater extend compared to small hit lists.