TY - JOUR
T1 - Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset
AU - Shi, Ming Guang
AU - Xia, Jun Feng
AU - Li, Xue Ling
AU - Huang, De Shuang
N1 - Funding Information:
This work was supported by the grants of the National Science Foundation of China, Nos. 60472111 and 30570368, the grant from the National Basic Research Program of China (973 Program), No. 2007CB311002, the grants from the National High Technology Research and Development Program of China (863 Program), Nos. 2007AA01Z167 and 2006AA02Z309, the grant of Oversea Outstanding Scholars Fund of CAS, No. 2005-1-18, HFUT, No. 070403F and the Knowledge Innovation Program of the Chinese Academy of Sciences (0823A16121).
PY - 2010/3
Y1 - 2010/3
N2 - Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under [email protected].
AB - Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under [email protected].
KW - Correlation coefficient
KW - Gold standard negatives dataset
KW - Gold standard positives dataset
KW - Protein sequence
KW - Protein-protein interactions
KW - Support vector machine
UR - http://www.scopus.com/inward/record.url?scp=77951667947&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77951667947&partnerID=8YFLogxK
U2 - 10.1007/s00726-009-0295-y
DO - 10.1007/s00726-009-0295-y
M3 - Article
C2 - 19387790
AN - SCOPUS:77951667947
SN - 0939-4451
VL - 38
SP - 891
EP - 899
JO - Amino Acids
JF - Amino Acids
IS - 3
ER -