Gateway for:

Member Countries

Comparative Genomics of Bacteria


Comparative Genomics and Metagenomics: Models, Algorithms and Large Scale Analysis; Nanotechnologies for Selective Transport

Tech Area / Field

  • BIO-CGM/Cytology, Genetics and Molecular Biology/Biotechnology

8 Project completed

Registration date

Completion date

Senior Project Manager
Melnikov V G

Leading Institute
Institute of Strategic Stability, Russia, Moscow

Supporting institutes

  • Russian Academy of Sciences / Institute of Problems of Information Transmission, Russia, Moscow


  • INRA-UEPSD Bat 405, France, Jouy en Josas\nForschungszentrum für Umwelt und Gesundheit GmbH (GSF) / Institute for Bioinformatics, Germany, Neuherberg

Project summary

The number of completely sequenced bacterial genomes approaches one thousand and continues to grow. In addition, high-throughput methods such as expression analysis with oligonucleotide arrays, study of protein-DNA interactions using ChIP-on-chip technology, mass spectrometry-based methods for the analysis of proteomes etc. generate large amounts of perse functional data.

This creates a challenge and an opportunity for biologists and, in particular, bioinformaticians. The challenge is to annotate these genomes, to extract biological knowledge from nucleotide sequences and noisy transcriptomic, proteomic and metabolomic data, and, eventually, to develop methods for characterizing properties of organisms from their genomes. The opportunity is that the comparative genomic methods allow one to perform functional annotation and initial metabolic reconstruction given multiple genome sequences, whereas emerging systems biology approaches use the functional data to create models of metabolic and regulatory subsystems. Scientifically, probably, the most exciting components of such studies are, firstly, reconstruction of evolutionary events that formed the extant genomes, regulatory and metabolic systems; and secondly, simultaneous analysis of all major functional systems (transcription and its regulation, translation, protein-protein interactions; metabolic reactions), allowing for non-reductionist description of living organisms.

Our previous, successfully completed project ISTC 2766 was aimed at developing algorithms for large-scale analysis of regulatory sites, detailed analysis of selected regulatory systems, models of regulation by attenuation modeling of protein family and species evolution. The results were published in 91 articles, see the list below. These results showed that our methods and models can be extended to take into account the data explosion, and to develop methods for utilizing new types of data.

In particular, we plan to analyze metagenomes. In the framework of one understanding of this term, collection of random sequence fragments from environmental samples, we will develop methods characterizing their metabolic and regulatory potential. In the second aspect, we will compare genome complements of closely related strains (“supragenomes”), aiming at description of functional cores (genes present in all genomes) and periphery (genes that are strain-specific and thus determine unique properties of strains such as pathogenicity, virulence, and metabolic capabilities).

We will continue comparative analysis of regulatory systems, paying special attention to complex regulatory systems involving multiple regulators (respiration, nitrogen utilization etc.). We will attempt not only to describe the present state of these systems, but to reconstruct their evolution. We will link comparative genomic studies to the analysis of expression arrays with the aim to distinguish between direct transcriptional regulation (regulons), regulation via cascades of transcription factors (modulons), and other, indirect modes of regulation (stimulons).

On the other hand, we will continue to develop tools for the large-scale analysis of regulation. At that, we will continue our collaboration with the Lawrence Berkeley National Laboratory (USA) in the RegTransBase project. At present, this database contains data about 650 regulators participating in about 6500 regulatory events in 155 prokaryotes. We will continue compiling these data and use it to generate recognition profiles (currently more than 100). Further, we will develop a set of Internet tools for the comparative analysis of regulation using these profiles, and then, a semi-automated procedure for generating such profiles (specific for taxa and metabolic/functional subsystems). Such tools will allow inpiduals, labs and commercial units to pipeline identification of regions involved in gene regulation in a range of studies, without a need for consultation and outsourcing.

We will attempt to describe the basic evolutionary events shaping the regulons (regulon expansion/contraction; duplication of regulators and regulated genes; horizontal transfer; changes in regulator specificity towards co-factors; co-evolution of regulators and their DNA binding motifs) and to estimate their prevalence. It is well-known that the fraction of transcription factors in a genome grows linearly with the genome size. We will study the distribution of paralogous transcription factors in genomes in order to explain this observation (preliminary observations seem to indicate that it is mainly due to explosions in the size of few transcription factor families rather than introduction of new families or uniform increase in all families). On the other hand, we will systematically analyze correlated changes in the amino acid sequences of the transcription-factor DNA-binding domains and the DNA motifs bound by these factors and to identify branches on the factor phylogenetic tree, along which the motif have changed.

We will continue to study RNA-based regulatory systems. In particular, we will complete characterizing the T-box system regulating amino acids metabolism (mainly in Firmicutes). The preliminary results show that the evolution of this regulon is characterized by frequent lineage-specific expansions, duplications and changes in specificity. We will also consider other systems, in particular, classic attenuation of selected amino acids in actinobacteria, Rho-mediated translation regulation of amino acids (e.g., cysteine) in actinobacteria, translation regulation with the originally discovered LEU-element in actinobacteria, T-box system of amino acid metabolism regulation in actinobacteria and cyanobacteria, riboswitch regulatory system in various bacteria, as well as regulation in chloroplasts, in particular, RNA editing and translation delay of genes involved in photosynthesis.

In collaboration with the Rutgers University (USA) we will study regulation in phages. This is virtually uncharted area of bioinformatics, whereas the experimental studies have been limited to a very limited number of traditional subjects (lambda, T4, T5, T7). We will study bacterial defense mechanisms. This is a well-researched area in experimental molecular biology and bacteriology, but the bioinformatics analyses of such systems, and especially, their regulation are scarce.

We will complete and extend modeling of regulation systems of amino acid related substrate biosynthesis.


In collaboration with colleagues from INRA (France) we will perform functional annotation of approximately 2Mb of metagenomic data from the human gut bacterial community (preliminary analysis demonstrated that it is dominated by Bacteroides spp.). This study is important for understanding the metabolic capabilities of human gut microflora and, eventually, its influence on human health.

In collaboration with LBNL (USA) we will develop a system for automated identification of candidate binding sites of known transcriptional regulators as well as new potential regulatory motifs. This tool will be available via the Internet for the community use for both automated annotation of newly sequenced genomes and in-depth analysis of particular genomes, metabolic and regulatory systems.

We will describe evolution of several complex regulatory systems, including changes in regulons, re-wiring of regulatory cascades, and interactions with indirect regulatory systems. In particular, we will analyze the regulation of respiration (NarL/NarP, FNR, ArcA transcription factors), compare the results of the genomic analysis with the expression data from various bacteria, deduce the regulatory cascades, their evolution and their influence on gene expression, and thus characterize this important physiological system. Similarly, we will characterize regulation of nitrogen assimilation in plant-symbiont alpha-proteobacteria (Rhisobiaceae and related taxa). The practical relevance of this study lies in the fact that nitrogen availability for plants is one of the main limitations in agriculture. Finally, we will describe the evolutionary history of the T-box regulons, and in course of this work, identify new genes involved in the amino acid metabolism.

We will describe co-evolution of transcription factors and their binding motifs in several families of transcription factors. The initial data for this will be generated by systematic comparative genomic analysis of genes regulated by transcription factors from several families (LacI, FNR/CRP, FUR, Rrf2 etc.). This will be an important step towards understanding of the rules of protein-DNA interactions. We will describe the evolutionary dynamics of transcription factor families, analyze genome-specific expansions and link these expansions to the physiological and metabolic properties of bacteria.

We will describe the regulation of early and late genes in several recently sequenced phages of Thermus thermophilus and Escherichia coli. These results will be supplied to a collaborating experimental group for verification.

The comparative genomic analysis of bacterial defense mechanisms (toxin-antitoxin, restriction-modification, microcin) will result in identification of new systems and description of their regulation. Modeling of these systems will lead to better understanding of bacterial immunity, which is not only important per se, but also has implications for ecology of bacterial communities, including pathogenic ones, and for drug development.

We will characterize genome “periphery” (strain-specific genes) in several groups of strains and close species (“supergenomes”), including groups containing both pathogens and commensals (E.coli, B.cereus/anthracis). This should lead to better characterization of the imprint of the pathogenic lifestyle on the metabolic capabilities of strains. We will also perform the same analysis to other strains, in particular, those involved in dairy production (Streptococcus thermophilus) and nitrogen-fixators forming symbioses with plants. We will test the hypothesis that supergenome cores are more conserved than periphery, both in the terms of presence/absence and in terms of evolutionary rate. See the list of our 2004-2007 years papers below.