Effect of the mutation rate and background size on the quality of pathogen identification

Chris Reed, Viacheslav Fofanov, Catherine Putonti, Sergei Chumakov, Tom Slezak, Yuriy Fofanov

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is > 5%.

Original languageEnglish (US)
Pages (from-to)2665-2671
Number of pages7
JournalBioinformatics
Volume23
Issue number20
DOIs
StatePublished - Oct 15 2007
Externally publishedYes

Fingerprint

Pathogens
Mutation Rate
Mutation
Genes
Target
Signature
Genomics
Genome Size
Identification (control systems)
DNA
Soil
Air
Genome
Soils
Food
Water
Background
Subsequence
False Positive
Isolation

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Effect of the mutation rate and background size on the quality of pathogen identification. / Reed, Chris; Fofanov, Viacheslav; Putonti, Catherine; Chumakov, Sergei; Slezak, Tom; Fofanov, Yuriy.

In: Bioinformatics, Vol. 23, No. 20, 15.10.2007, p. 2665-2671.

Research output: Contribution to journalArticle

Reed, Chris ; Fofanov, Viacheslav ; Putonti, Catherine ; Chumakov, Sergei ; Slezak, Tom ; Fofanov, Yuriy. / Effect of the mutation rate and background size on the quality of pathogen identification. In: Bioinformatics. 2007 ; Vol. 23, No. 20. pp. 2665-2671.
@article{e7fe72c8519b4003bf07ee84516e5a2e,
title = "Effect of the mutation rate and background size on the quality of pathogen identification",
abstract = "Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is > 5{\%}.",
author = "Chris Reed and Viacheslav Fofanov and Catherine Putonti and Sergei Chumakov and Tom Slezak and Yuriy Fofanov",
year = "2007",
month = "10",
day = "15",
doi = "10.1093/bioinformatics/btm420",
language = "English (US)",
volume = "23",
pages = "2665--2671",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "20",

}

TY - JOUR

T1 - Effect of the mutation rate and background size on the quality of pathogen identification

AU - Reed, Chris

AU - Fofanov, Viacheslav

AU - Putonti, Catherine

AU - Chumakov, Sergei

AU - Slezak, Tom

AU - Fofanov, Yuriy

PY - 2007/10/15

Y1 - 2007/10/15

N2 - Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is > 5%.

AB - Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is > 5%.

UR - http://www.scopus.com/inward/record.url?scp=35748932362&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35748932362&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btm420

DO - 10.1093/bioinformatics/btm420

M3 - Article

VL - 23

SP - 2665

EP - 2671

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 20

ER -