TY - JOUR
T1 - How independent are the appearances of n-mers in different genomes?
AU - Fofanov, Yuriy
AU - Luo, Yi
AU - Katili, Charles
AU - Wang, Jim
AU - Belosludtsev, Yuri
AU - Powdrill, Thomas
AU - Belapurkar, Chetan
AU - Fofanov, Viacheslav
AU - Li, Tong Bin
AU - Chumakov, Sergey
AU - Pettitt, B. Montgomery
N1 - Funding Information:
The authors thank Prof. M. Hogan for interesting conversations. S.C., B.M.P. and Y.F. thank TLCC for partial funding of this work. T.-B.L. was supported by a training fellowship from the W.M. Keck Foundation to the Gulf Coast Consortia through the Keck Center for Computational and Structural Biology. B.M.P. and Y.F. thank the NIH for partial support of this work and NPACI for computational support. S.C. is grateful to the University of Houston Computer Science Department for hospitality.
PY - 2004/10/12
Y1 - 2004/10/12
N2 - Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5 - 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.
AB - Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5 - 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.
UR - http://www.scopus.com/inward/record.url?scp=7244254459&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=7244254459&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/bth266
DO - 10.1093/bioinformatics/bth266
M3 - Article
C2 - 15087315
AN - SCOPUS:7244254459
SN - 1367-4803
VL - 20
SP - 2421
EP - 2428
JO - Bioinformatics
JF - Bioinformatics
IS - 15
ER -