Area 5: Molecular Structure and Function

Phylogenetic analyses have become critical for the analysis of comparative sequence databases, and for using these databases to make structural and functional predictions for macromolecules.  Perhaps the best example of this line of research concerns the structural and functional predictions that have been made through comparative analysis of ribosomal RNA (rRNA).  Ribosomal RNA genes are arguably the most studied sequences for the analysis of phylogenetic relationships (see reviews by Woese, 1987; Hillis and Dixon, 1991).  Our research in this area (Gutell, Ellington, Jansen, Hillis) addresses three principal questions: 1) How can we best infer secondary and tertiary structures of RNA molecules?  2) How can we use the structural and phylogenetic inferences about RNA molecules to predict function?  3) How can we use structural properties about RNA molecules to predict phylogenetic trees with greater accuracy?  We have already made significant headway on the first question by using known phylogenetic histories with statistically based techniques, and we are beginning to obtain results on the other two questions.  The research is by nature very computational, and there is much room for further computational development.

There are currently more than 10,000 complete (or nearly complete) small subunit rRNA sequences and 1,300 complete (or nearly complete) large subunit rRNA sequences.  This large collection of sequences has been used extensively by Gutell for comparative structure analysis to determine and refine the structural models of rRNA.  This same collection of sequences has been aligned based on similar sequence and structure; the small subunit rRNA data matrix represents the largest set of aligned genes that spans the Tree of Life.  Moreover, the number of rRNA sequences continues to increase.  On average there are approximately 50 new complete rRNA sequences that become available in GenBank every week.  Thus, the Gutell lab continues to identify, enter, and align new rRNA sequences into their large collection of rRNA sequences and alignments.

The Gutell lab has been analyzing ribosomal RNA from a comparative perspective for the past 20 years.  During this time, Gutell has developed a Comparative RNA Web site (http://www.rna.icmb.utexas.edu/) that reports and facilitates detailed analyses of the rRNA genes.  The aligned rRNA genes represent an enormous opportunity for the analysis of a large phylogenetic dataset that represents samples from throughout life, and successful phylogenetic analysis of this data set is important for extending structural and functional predictions of rRNA.  The number of possible secondary structure models, composed of standard secondary structure helices with three or more consecutive G:C, A:U, and G:U basepairs, is extremely large. For example, the approximate number of secondary structure models for 16S rRNA, a molecule with approximately 1500 nucleotides, is 4.3 x 10393. Given our limited knowledge of the principles of RNA structure, we are currently unable to predict with confidence the best secondary structure models from first principles. For the same reasons, we are also unable to accurately predict tertiary interactions. Fortunately, it was realized by Woese and others (Fox and Woese, 1975) that all RNA molecules within the same family that perform the same function (e.g., tRNA, 16S rRNA) will form the same general two- and three-dimensional structure. This apparently simple concept allows for covariation analysis to predict secondary and tertiary base-pairings for several RNA molecules, including the 16S (and 16S-like) and 23S (and 23S-like) rRNAs from any species for which the primary sequence is known (Woese et al., 1983). With the recent publication of the crystal structure solutions for the rRNAs of Thermus thermophilus (Wimberly et al., 2000) and Holoarcula marismortui (Ban et al., 2000), we have evaluated the accuracy of these predicted structures. Approximately 98% of the predicted 16S and 23S rRNA basepairs are indeed present in the crystal structures.  In addition, comparative analysis has also been used to independently determine several structural motifs, such as tetraloop hairpin loops, E and E-like loops, AA and AG base-pairings at the ends of helices, and frequent adenosines in unpaired positions (Gutell et al., 2000).  The comparative structure analysis that has successfully identified these relationships is based on the structure-based alignment of the individual rRNA sequences and the development and implementation of the covariation and other comparative algorithms.

The extreme accuracy in predicting the base-pairings in the rRNA structures validates the comparative analysis paradigm and supports the accuracy of the structure-based alignments.  In addition, the predicted structures for all of the rRNAs, including elements present in rRNA sequences that do not have a homologous component in the structures solved in the crystal structure, are likely accurate.  The net result is that we have excellent structural predictions for each of the rRNA sequences in our data collection.

There is considerable opportunity for reciprocal feedback between computational phylogenetics and the structural and functional analysis of this enormous collection of rRNA sequences.  Thorough phylogenetic analysis of the complete database requires methods for analyzing very large phylogenetic trees.  As these phylogenetic analyses are completed, we anticipate that the models and alignments can be refined to an even greater degree.  The structural analyses, in turn, provide an additional source of data for analysis, beyond the primary structure.  This information is important for phylogenetic analysis in several ways.  This system may provide one of the best test cases for analyzing and understanding the interdependencies among nucleotide positions within a gene.  Furthermore, the higher-order structures may themselves be useful for inferring phylogenetic relationships, especially at the deeper levels of the tree where the primary structure analysis is least reliable.

Eventually, we imagine the inference of ancestral three-dimensional structures for rRNAs from throughout the Tree of Life.  These structures can then be visualized in three dimensions, and used to create movies of the evolution of ribosomal RNA structure from the origin of life to the present along any given phylogenetic pathway.  By working with the scientific visualization group, the inferred structures can be used to create a database that allows the exploration of structural evolution of rRNA, which can then be used to study the structural-functional relationships of rRNA motifs.