A structured approach to predictive modeling of a two-class problem using multidimensional data sets

Heidi Spratt, Hyunsu Ju, Allan R. Brasier

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.

Original languageEnglish (US)
Pages (from-to)73-85
Number of pages13
JournalMethods
Volume61
Issue number1
DOIs
StatePublished - May 15 2013

Fingerprint

Learning systems
Visualization
Pipelines
Genes
Experiments
Proteomics
Communicable Diseases
Communication
Processing
Genome
Datasets
Machine Learning

Keywords

  • Classification
  • Data exploration
  • Data mining
  • Machine learning
  • Supervised learning

ASJC Scopus subject areas

  • Molecular Biology
  • Biochemistry, Genetics and Molecular Biology(all)

Cite this

A structured approach to predictive modeling of a two-class problem using multidimensional data sets. / Spratt, Heidi; Ju, Hyunsu; Brasier, Allan R.

In: Methods, Vol. 61, No. 1, 15.05.2013, p. 73-85.

Research output: Contribution to journalArticle

@article{e2c1eefaf62748eb990a8f8d08e31b58,
title = "A structured approach to predictive modeling of a two-class problem using multidimensional data sets",
abstract = "Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.",
keywords = "Classification, Data exploration, Data mining, Machine learning, Supervised learning",
author = "Heidi Spratt and Hyunsu Ju and Brasier, {Allan R.}",
year = "2013",
month = "5",
day = "15",
doi = "10.1016/j.ymeth.2013.01.002",
language = "English (US)",
volume = "61",
pages = "73--85",
journal = "Methods",
issn = "1046-2023",
publisher = "Academic Press Inc.",
number = "1",

}

TY - JOUR

T1 - A structured approach to predictive modeling of a two-class problem using multidimensional data sets

AU - Spratt, Heidi

AU - Ju, Hyunsu

AU - Brasier, Allan R.

PY - 2013/5/15

Y1 - 2013/5/15

N2 - Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.

AB - Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.

KW - Classification

KW - Data exploration

KW - Data mining

KW - Machine learning

KW - Supervised learning

UR - http://www.scopus.com/inward/record.url?scp=84878866507&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84878866507&partnerID=8YFLogxK

U2 - 10.1016/j.ymeth.2013.01.002

DO - 10.1016/j.ymeth.2013.01.002

M3 - Article

C2 - 23321025

AN - SCOPUS:84878866507

VL - 61

SP - 73

EP - 85

JO - Methods

JF - Methods

SN - 1046-2023

IS - 1

ER -