Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset

Ming Guang Shi, Jun Feng Xia, Xue Ling Li, De Shuang Huang

Research output: Contribution to journalArticle

58 Citations (Scopus)

Abstract

Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.

Original languageEnglish (US)
Pages (from-to)891-899
Number of pages9
JournalAmino Acids
Volume38
Issue number3
DOIs
StatePublished - Mar 2010
Externally publishedYes

Fingerprint

Proteins
Yeast
Support vector machines
Datasets
Postal Service
Proteome
Computational methods
MATLAB
Machinery
Saccharomyces cerevisiae
Yeasts
Throughput
Technology
Support Vector Machine

Keywords

  • Correlation coefficient
  • Gold standard negatives dataset
  • Gold standard positives dataset
  • Protein sequence
  • Protein-protein interactions
  • Support vector machine

ASJC Scopus subject areas

  • Biochemistry
  • Clinical Biochemistry
  • Organic Chemistry

Cite this

Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset. / Shi, Ming Guang; Xia, Jun Feng; Li, Xue Ling; Huang, De Shuang.

In: Amino Acids, Vol. 38, No. 3, 03.2010, p. 891-899.

Research output: Contribution to journalArticle

Shi, Ming Guang ; Xia, Jun Feng ; Li, Xue Ling ; Huang, De Shuang. / Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset. In: Amino Acids. 2010 ; Vol. 38, No. 3. pp. 891-899.
@article{702bc872b6504bc485977f2f363cd26c,
title = "Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset",
abstract = "Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94{\%} using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.",
keywords = "Correlation coefficient, Gold standard negatives dataset, Gold standard positives dataset, Protein sequence, Protein-protein interactions, Support vector machine",
author = "Shi, {Ming Guang} and Xia, {Jun Feng} and Li, {Xue Ling} and Huang, {De Shuang}",
year = "2010",
month = "3",
doi = "10.1007/s00726-009-0295-y",
language = "English (US)",
volume = "38",
pages = "891--899",
journal = "Amino Acids",
issn = "0939-4451",
publisher = "Springer Wien",
number = "3",

}

TY - JOUR

T1 - Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset

AU - Shi, Ming Guang

AU - Xia, Jun Feng

AU - Li, Xue Ling

AU - Huang, De Shuang

PY - 2010/3

Y1 - 2010/3

N2 - Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.

AB - Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.

KW - Correlation coefficient

KW - Gold standard negatives dataset

KW - Gold standard positives dataset

KW - Protein sequence

KW - Protein-protein interactions

KW - Support vector machine

UR - http://www.scopus.com/inward/record.url?scp=77951667947&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77951667947&partnerID=8YFLogxK

U2 - 10.1007/s00726-009-0295-y

DO - 10.1007/s00726-009-0295-y

M3 - Article

C2 - 19387790

AN - SCOPUS:77951667947

VL - 38

SP - 891

EP - 899

JO - Amino Acids

JF - Amino Acids

SN - 0939-4451

IS - 3

ER -