The Gene Ontology (GO) project provides a structured vocabulary of biological terms used by biological researchers as a tool for standardization of references to biological entities. Genes may be annotated with GO terms to indicate their roles or localizations in the cell. GO has been used in conjunction with high-throughput experimental methods, such as microarrays. In this setting, the interest is to determine whether sets of genes identified by the high-throughput experiment are enriched for GO terms: Do certain terms annotate more genes in the identified set than one might expect? Enriched terms are taken as a potential summary of the cellular function for the identified set of genes and may provide clues leading to new directions for investigation. Current methods for determining whether sets of genes are GO-enriched have certain well-known shortcomings. Many methods do not take the hierarchical structure of the ontology into account in determining enrichment. We address this drawback by introducing a new statistical test (TreeHugger) based on a novel per-gene scoring scheme for GO terms. Given a set of genes and a specified subset of those genes, our method determines enrichment of GO terms in the subset, taking into account the structure of the ontology and ascribing a lower weight to those terms that do not themselves directly annotate the given genes. Tests on simulated and real data indicate that our method is a conservative test for enrichment. Testing TreeHugger on a biological example reveals that it also reduces the redundancy caused by giving high scores to indirect annotations as provided by standard enrichment tests.
- Data analysis
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications
- Management Science and Operations Research