NPACT: Better understanding the protein coding potential of genomic sequences

14 Aug 2015

Featured Content, Featured Research, Publications

Luciano Brocchieri, Genetics Institute faculty member, and postdoctoral student Steve Oden have developed a computational program to help generate more accurate representations of the protein coding potential of genomic sequences.

Brocchieri is also an assistant professor in the department of molecular genetics and microbiology.

They published a paper in Bioinformatics in June about their program, N-Profile Analysis Computational Tool (NPACT), titled “Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans.” In the paper, which can be accessed here, the program is applied for the analysis of the GC-rich bacterium Anaeromyxobacter dehalogenans.

“We developed a new method to identify the genes based on the analysis of the genome sequence and also a web based interface which allows users to apply these methods,” Brocchieri said, “and to visualize the position of potential newly identified genes in relation to genome sequence features and previous annotations of the genome. This visual comparison greatly facilitates identifying interesting sequence features that are not accounted for in the previous annotation”

Researchers can access the program here.

The bacterium they analyzed is potentially significant with regard to bioremediation. Brocchieri and Oden performed the analysis with the intention of identifying the genetic features that would explain differences in efficiency and specificity in contaminant transformation between different strains of this bacterium.

These characteristics would potentially enable scientists to introduce the bacterium into toxic environments, sequester halogenated organic compounds such as uranium, and metabolize them into less toxic substances.

In order to use the program, a researcher plugs in a genomic sequence– they also have the option of including a pre-existing annotation– in the form of a text document. The program then performs a search for regions that have a periodicity in nucleotide usage typical of protein-coding genes.

The most powerful gene prediction programs rely on some knowledge of the composition of genes in a genome to build a template as a guide. This program does not.

“Our approach differs from other popular gene-prediction methods in that it’s agnostic,” Oden said. “It doesn’t need the information other programs need on how the genes look like in a particular genome. It just searches for significant periodicities based on general compositional properties of the sequence.”

The program returns a variety of results in the form of text documents, including lists showing where the program has located genes that were missed in the original genome annotation, errors such as misidentified genes or truncated genes from original data, and collections of previously predicted genes that do not exhibit the expected periodicity.

However, one of its most significant features is in its graphical representation of the results.

Researches can use NPACT to browse the entire genome, matching previous annotations to periodicity features. Their attention is specifically directed by the program towards those features that indicate the presence of missed or misannotated genes, enabling the user to identify potential errors and omissions in previous annotations.

NPACT also provides nucleotide sequences and protein translations for any identified gene, making it possible to perform further analyses, such as checking for evolutionary conservation of the proposed genes.

When Brocchieri and Oden performed an analysis of two Anaeromyxobacter dehalogenans strains, they found gaps in the original analysis of one of them.

“Of the two strains, this one was missing these pathways in the annotation,” Oden said. “But this program unambiguously found them in those gaps. These two strains are more similar than the [original] annotation would suggest.”