Abstract
Background: The development of accurate classification models depends upon the methods used to identify the most relevant variables. The aim of this article is to evaluate variable selection methods to identify important variables in predicting a binary response using nonlinear statistical models. Our goals in model selection include producing non-overfitting stable models that are interpretable, that generate accurate predictions and have minimum bias. This work was motivated by data on clinical and laboratory features of Helicobacter pylori infections obtained from 60 individuals enrolled in a prospective observational study. Results: We carried out a comprehensive performance comparison of several nonlinear classification models over the H. pylori data set. We compared variable selection results by Multivariate Adaptive Regression Splines (MARS), Logistic Regression with regularization, Generalized Additive Models (GAMs) and Bayesian Variable Selection in GAMs. We found that the MARS model approach has the highest predictive power because the nonlinearity assumptions of candidate predictors are strongly satisfied, a finding demonstrated via deviance chisquare testing procedures in GAMs. Conclusions: Our results suggest that the physiological free amino acids citrulline, histidine, lysine and arginine are the major features for predicting H. pylori peptic ulcer disease on the basis of amino acid profiling.
Original language | English (US) |
---|---|
Pages (from-to) | 95-101 |
Number of pages | 7 |
Journal | Journal of Proteomics and Bioinformatics |
Volume | 7 |
Issue number | 4 |
DOIs | |
State | Published - 2014 |
Keywords
- Amino acid analysis
- Classification
- Helicobacter pylori
- Peptic ulcer disease
- Variable selection
ASJC Scopus subject areas
- Biochemistry
- Molecular Biology
- Computer Science Applications
- Cell Biology