Annotation of Bacterial Genomes
New Methods of Computational Annotation of Bacterial Genomes: Development and Application
Tech Area / Field
- BIO-CGM/Cytology, Genetics and Molecular Biology/Biotechnology
8 Project completed
Senior Project Manager
Institute of Strategic Stability, Russia, Moscow
- Institute of Problems of Information Transmission, Russia, Moscow
- University of British Columbia / The Center for Molecular Medicine and Therapeutics, Canada, BC, Vancouver
Project summaryDevelopment of industrial sequencing of complete genomes leads to the exponential explosion of available data and makes it necessary to develop methods for computer analysis of the genome data. This problem was set as the first priority in many countries including Russia and USA (The program “Genomes for Life” of the Department of Energy). Current methods of computer annotation of genomes are based on analysis of protein homologies and involve other methods of analysis including positional clusters, phylogenetic patterns and metabolic reconstruction. Further development of experimental techniques leads to the growth of data in other areas as well, for instance in analysis of gene expression using microarrays.
One of the important areas of genome analysis is prediction of signals regulating gene expression. This problem not only is of fundamental interest for understanding the fundamental processes within a bacterial cell, but has applications in other areas of computer genomics (e.g. genome annotation). It also has implications in practice, e.g. pharmacology and biotechnology.
From the purely scientific point of view, it is important to understand not only what a cell produces (the answer to this question is given by metabolic reconstruction), but also when, in what conditions specific metabolic pathways and physiological systems are switched on and off. Identification of regulatory signals is necessary and natural follow-up of any experiment of mass analysis of gene expression. Moreover, without such analysis the results of these experiments are of limited interest, as there are no other ways to assign regulators to inpidual subsystems and to describe regulatory cascades.
In many cases analysis of regulation leads to predictions that could not be made by other methods. In particular, it allows one to predict specificity of transporters and regulatory proteins from multigene families and to identify new enzymes missing in reconstructed pathways. Analysis of regulation is important for creation of hyperproducing strains and creation of strains with new properties using introduction of foreign genes, e.g. for bioremediation. Finally, many regulatory systems are important in medicine as potential targets for drug of both narrow and wide specificity.
Thus, two types of problems arise: (1) identification of regulatory signals in noisy data about co-regulation of groups of genes and (2) search for new regulatory sites in a genome, that is, for genes that are new members of co-regulated groups. It should be noted that there exist two major types of regulatory signals: transcriptional operators with which regulatory proteins interact, and transcriptional and translational signals that involve formation of RNA secondary structure. This project aims at research in both areas, development of effective algorithms for identification of regulatory signals, and application of these algorithms to analysis of specific regulatory systems important in practice.
The problem of identification of regulatory signals is a well-known one. We have developed and implemented an effective algorithm for signal identification, and its testing on simulated and real data and benchmarking with other existing programs demonstrated its practical applicability. We plan to perform mass analysis of regulatory signals in two important groups of bacteria: gamma-proteobacteria (including enterobacteria, pasteurellas, vibrios) and gram-positive bacteria from the bacillus/clostridium group. Orthologous genes will be considered, for which conservation of regulatory signals can be assumed, and then common signals in upstream regions of these genes will be identified.
We plan to perform detailed comparative analysis of bacterial regulatory systems important from the practical point of view. Our group has pioneered this approach and the major fraction of research papers in this area are still published by the participants of this project. Many predictions have been confirmed in experiment, both by our collaborators and independently. We plan to analyze systems for iron and zinc utilization (pathogenesis), quorum sensing (stress and pathogenesis), response to heat and hyperoxide shock (stress), heavy metal tolerance (stress). Research in the latter two areas will be done in collaboration with the Lawrence Berkeley National Laboratory (USA) in the framework of the “Genomes for Life” program (Department of Energy of the USA). The regulatory systems to be analyzed are important for bioremediation.
Methods for analysis of RNA regulatory signals are much less developed. Standard programs searching for regulatory sites are not suitable for analysis of secondary structure patterns. To our knowledge, no programs for identification of conserved regulatory structures have been published so far. We plan to implement two approaches to identification of such structures. Firstly, we will utilize the fact that regulatory structures are often conserved in related genomes and thus can be found by alignment of intergenic regions of orthologous genes. A preliminary version of this algorithm has been implemented already and successfully tested on a model system (tRNAs). Secondly, we will use the common features of most regulatory RNA structures, namely their ability to assume alternative conformations. This algorithm also has been implemented and applied to a number of known systems (attenuators of amino acid operons in gamma-proteobacteria), which resulted in identification of several novel attenuators. We plan to apply both algorithms to analysis of orthologous genes of bacteria from the taxonomic groups mentioned above.
We have successfully applied the comparative analysis of RNA-based regulation to several metabolic systems. This resulted in identification of numerous new members of these systems, mainly transporters, but also several enzymes. Completely new systems of regulation of vitamin metabolism (riboflavin and thiamin) were predicted. For the first time it was predicted that formation of alternative secondary structures is directly regulated by binding of small molecules. Recently these predictions were confirmed in experiment. We plan to continue research in this area and consider metabolic systems of purine and pyrimidine biosynthesis, transport and biosynthesis of cobalt and cobalamin (vitamin B12), metabolism of amino acids in gram-positive bacteria.
Another important area of computer genomics of prokaryotes is analysis of evolution at different time scales. At the microevolution level, it is possible to analyze hypervariable sites in closely related genomes, e.g. different strains, as such sites are characteristic to surface proteins involved in pathogenesis. Thus identification of hypervariable epitopes is important for assessing reliability of diagnostic kits and vaccines as well as development of synthetic vaccines of new generation. It should be noted that surface proteins are often specific for taxonomic groups and thus are poorly amenable to the standard comparative analysis.
We have extended the Nei method for identification of hypervariable sites and successfully applied it to analysis of viral genomes. We plan to apply it to analysis of persistent pathogens such as chlamydiae, mycobacteria, streptococci, staphylococci, malarial plasmodia, identify genes encoding surface proteins, and find in these genes hypervariable residues that are displayed on the surface and serve as immune epitopes.
The last two problems considered in this project are related to macroevolution of bacterial genomes. The first problem is reconciliation of trees constructed for inpidual genes (proteins) and construction of the common species tree taking into account both errors in construction of inpidual trees and fundamental processes such as gene loss and duplication. The problem of tree reconciliation is known, but it has no satisfactory solution. We developed, implemented and tested a new algorithm for this problem. We plan to fine-tune the algorithm’s parameters by analysis of many gene trees, and then use the algorithm for the second problem, identification of genes whose evolutionary history involved horizontal transfer.
The importance of the latter phenomenon for bacterial genomics was understood only a few years ago as a result of analysis of multiple complete genomes. Massive horizontal transfer seems to explain thermophily of some bacteria (Thermotoga, Aquifex, Thermoanaerobacter), and also metabolic plasticity of archaeae, for instance, Methanosarcina. Horizontal transfer plays an important, if not decisive, role in such phenomena as virulence and drug resistance of pathogenic bacteria, and also ability of some bacteria to metabolize toxic substances (such as phenols and components of crude oil) and to reduce heavy metals, which is important for bioremediation.
Currently identification of genes suspected to be subject to horizontal transfer is done manually. We plan to create a program for automated identification of such genes based on the following idea: horizontally transferred genes are those genes, that most strongly interfere with mapping of the gene tree for a given family to the species tree. Preliminary results show applicability and reliability of this approach. We plan to perform total analysis of orthologous gene families from the COG database (National Center for Biotechnology Information, National Library of Medicine, NIH, USA) and then, in collaboration with colleagues from NCBI, analyze the obtained lists of suspect genes in detail. In addition to understanding the genome evolution, this analysis will contribute to the functional annotation of genomes. Thus we plan to merge the results of this study with the output of other genomic projects, in particular, mass analysis of regulatory signals described above. Indeed, in many cases horizontally transferred genes are co-regulated (for instance, this is the case with various drug resistance and heavy metal tolerance systems). Therefore, combining the developed techniques, we should be able to study the history of the corresponding regulatory systems, which is important in practice both for avoiding expansion of drug-resistant strains, and creation of new strains for bioremediation.