Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution

Meizhuo Zhang, Catherine Putonti, Sergei Chumakov, Adhish Gupta, George E. Fox, Dan Graur, Yuriy Fofanov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person's analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8- to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non-coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.

Original languageEnglish (US)
Title of host publicationAIP Conference Proceedings
Pages13-18
Number of pages6
Volume854
DOIs
StatePublished - 2006
Externally publishedYes
Event9h Mexican Symposium on Medical Physics - Guadalajara, Jalisco, Mexico
Duration: Mar 18 2006Mar 23 2006

Other

Other9h Mexican Symposium on Medical Physics
CountryMexico
CityGuadalajara, Jalisco
Period3/18/063/23/06

Fingerprint

pathogens
genome
genes
inspection
sequencing
random processes
microorganisms
conservation
insertion
coding
deoxyribonucleic acid

Keywords

  • Pathogen identification
  • Short subsequences
  • Statistical properties

ASJC Scopus subject areas

  • Physics and Astronomy(all)

Cite this

Zhang, M., Putonti, C., Chumakov, S., Gupta, A., Fox, G. E., Graur, D., & Fofanov, Y. (2006). Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution. In AIP Conference Proceedings (Vol. 854, pp. 13-18) https://doi.org/10.1063/1.2356390

Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution. / Zhang, Meizhuo; Putonti, Catherine; Chumakov, Sergei; Gupta, Adhish; Fox, George E.; Graur, Dan; Fofanov, Yuriy.

AIP Conference Proceedings. Vol. 854 2006. p. 13-18.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, M, Putonti, C, Chumakov, S, Gupta, A, Fox, GE, Graur, D & Fofanov, Y 2006, Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution. in AIP Conference Proceedings. vol. 854, pp. 13-18, 9h Mexican Symposium on Medical Physics, Guadalajara, Jalisco, Mexico, 3/18/06. https://doi.org/10.1063/1.2356390
Zhang M, Putonti C, Chumakov S, Gupta A, Fox GE, Graur D et al. Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution. In AIP Conference Proceedings. Vol. 854. 2006. p. 13-18 https://doi.org/10.1063/1.2356390
Zhang, Meizhuo ; Putonti, Catherine ; Chumakov, Sergei ; Gupta, Adhish ; Fox, George E. ; Graur, Dan ; Fofanov, Yuriy. / Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution. AIP Conference Proceedings. Vol. 854 2006. pp. 13-18
@inproceedings{d7a562c358d444ffa47acd729fef44ae,
title = "Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution",
abstract = "Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person's analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8- to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non-coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.",
keywords = "Pathogen identification, Short subsequences, Statistical properties",
author = "Meizhuo Zhang and Catherine Putonti and Sergei Chumakov and Adhish Gupta and Fox, {George E.} and Dan Graur and Yuriy Fofanov",
year = "2006",
doi = "10.1063/1.2356390",
language = "English (US)",
volume = "854",
pages = "13--18",
booktitle = "AIP Conference Proceedings",

}

TY - GEN

T1 - Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution

AU - Zhang, Meizhuo

AU - Putonti, Catherine

AU - Chumakov, Sergei

AU - Gupta, Adhish

AU - Fox, George E.

AU - Graur, Dan

AU - Fofanov, Yuriy

PY - 2006

Y1 - 2006

N2 - Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person's analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8- to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non-coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.

AB - Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person's analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8- to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non-coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.

KW - Pathogen identification

KW - Short subsequences

KW - Statistical properties

UR - http://www.scopus.com/inward/record.url?scp=33846531154&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33846531154&partnerID=8YFLogxK

U2 - 10.1063/1.2356390

DO - 10.1063/1.2356390

M3 - Conference contribution

VL - 854

SP - 13

EP - 18

BT - AIP Conference Proceedings

ER -