How independent are the appearances of n-mers in different genomes?

Yuriy Fofanov, Yi Luo, Charles Katili, Jim Wang, Yuri Belosludtsev, Thomas Powdrill, Chetan Belapurkar, Viacheslav Fofanov, Tong Bin Li, Sergey Chumakov, Bernard Pettitt

Research output: Contribution to journalArticle

41 Citations (Scopus)

Abstract

Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5 - 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.

Original languageEnglish (US)
Pages (from-to)2421-2428
Number of pages8
JournalBioinformatics
Volume20
Issue number15
DOIs
StatePublished - Oct 12 2004
Externally publishedYes

Fingerprint

Genome
Genes
Microbial Genome
Technology
DNA Probes
Random Sets
DNA sequences
Correlation Analysis
Infectious Diseases
Nucleotides
Communicable Diseases
Subsequence
Viruses
DNA Sequence
Statistical property
Virus
Biology
DNA
Probe
Polymerase Chain Reaction

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

How independent are the appearances of n-mers in different genomes? / Fofanov, Yuriy; Luo, Yi; Katili, Charles; Wang, Jim; Belosludtsev, Yuri; Powdrill, Thomas; Belapurkar, Chetan; Fofanov, Viacheslav; Li, Tong Bin; Chumakov, Sergey; Pettitt, Bernard.

In: Bioinformatics, Vol. 20, No. 15, 12.10.2004, p. 2421-2428.

Research output: Contribution to journalArticle

Fofanov, Y, Luo, Y, Katili, C, Wang, J, Belosludtsev, Y, Powdrill, T, Belapurkar, C, Fofanov, V, Li, TB, Chumakov, S & Pettitt, B 2004, 'How independent are the appearances of n-mers in different genomes?', Bioinformatics, vol. 20, no. 15, pp. 2421-2428. https://doi.org/10.1093/bioinformatics/bth266
Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T et al. How independent are the appearances of n-mers in different genomes? Bioinformatics. 2004 Oct 12;20(15):2421-2428. https://doi.org/10.1093/bioinformatics/bth266
Fofanov, Yuriy ; Luo, Yi ; Katili, Charles ; Wang, Jim ; Belosludtsev, Yuri ; Powdrill, Thomas ; Belapurkar, Chetan ; Fofanov, Viacheslav ; Li, Tong Bin ; Chumakov, Sergey ; Pettitt, Bernard. / How independent are the appearances of n-mers in different genomes?. In: Bioinformatics. 2004 ; Vol. 20, No. 15. pp. 2421-2428.
@article{c55e377cfdf14ea1ae8602aeeff3e5b6,
title = "How independent are the appearances of n-mers in different genomes?",
abstract = "Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5 - 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.",
author = "Yuriy Fofanov and Yi Luo and Charles Katili and Jim Wang and Yuri Belosludtsev and Thomas Powdrill and Chetan Belapurkar and Viacheslav Fofanov and Li, {Tong Bin} and Sergey Chumakov and Bernard Pettitt",
year = "2004",
month = "10",
day = "12",
doi = "10.1093/bioinformatics/bth266",
language = "English (US)",
volume = "20",
pages = "2421--2428",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "15",

}

TY - JOUR

T1 - How independent are the appearances of n-mers in different genomes?

AU - Fofanov, Yuriy

AU - Luo, Yi

AU - Katili, Charles

AU - Wang, Jim

AU - Belosludtsev, Yuri

AU - Powdrill, Thomas

AU - Belapurkar, Chetan

AU - Fofanov, Viacheslav

AU - Li, Tong Bin

AU - Chumakov, Sergey

AU - Pettitt, Bernard

PY - 2004/10/12

Y1 - 2004/10/12

N2 - Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5 - 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.

AB - Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5 - 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.

UR - http://www.scopus.com/inward/record.url?scp=7244254459&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=7244254459&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bth266

DO - 10.1093/bioinformatics/bth266

M3 - Article

VL - 20

SP - 2421

EP - 2428

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 15

ER -