Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases

Rovshan Sadygov, Hongbin Liu, John R. Yates

Research output: Contribution to journalArticle

112 Citations (Scopus)

Abstract

The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.

Original languageEnglish (US)
Pages (from-to)1664-1671
Number of pages8
JournalAnalytical Chemistry
Volume76
Issue number6
DOIs
StatePublished - Mar 15 2004
Externally publishedYes

Fingerprint

Amino Acids
Proteins
Peptides
Bins
Identification (control systems)
Statistical Models
Maximum likelihood
Statistics
Ions

ASJC Scopus subject areas

  • Analytical Chemistry

Cite this

Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases. / Sadygov, Rovshan; Liu, Hongbin; Yates, John R.

In: Analytical Chemistry, Vol. 76, No. 6, 15.03.2004, p. 1664-1671.

Research output: Contribution to journalArticle

@article{8e2b98d1867a49c5a516074fe821e23f,
title = "Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases",
abstract = "The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.",
author = "Rovshan Sadygov and Hongbin Liu and Yates, {John R.}",
year = "2004",
month = "3",
day = "15",
doi = "10.1021/ac035112y",
language = "English (US)",
volume = "76",
pages = "1664--1671",
journal = "Analytical Chemistry",
issn = "0003-2700",
publisher = "American Chemical Society",
number = "6",

}

TY - JOUR

T1 - Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases

AU - Sadygov, Rovshan

AU - Liu, Hongbin

AU - Yates, John R.

PY - 2004/3/15

Y1 - 2004/3/15

N2 - The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.

AB - The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.

UR - http://www.scopus.com/inward/record.url?scp=1542617945&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1542617945&partnerID=8YFLogxK

U2 - 10.1021/ac035112y

DO - 10.1021/ac035112y

M3 - Article

VL - 76

SP - 1664

EP - 1671

JO - Analytical Chemistry

JF - Analytical Chemistry

SN - 0003-2700

IS - 6

ER -