Area 5: Molecular Structure and Function
Phylogenetic analyses have become critical for the analysis of comparative sequence
databases, and for using these databases to make structural and functional
predictions for macromolecules. Perhaps
the best example of this line of research concerns the structural and functional
predictions that have been made through comparative analysis of ribosomal RNA (rRNA).
Ribosomal RNA genes are arguably the most studied sequences for the
analysis of phylogenetic relationships (see reviews by Woese, 1987; Hillis and
Dixon, 1991). Our research in this
area (Gutell, Ellington, Jansen, Hillis) addresses three principal questions: 1)
How can we best infer secondary and tertiary structures of RNA molecules?
2) How can we use the structural and phylogenetic inferences about RNA
molecules to predict function? 3)
How can we use structural properties about RNA molecules to predict phylogenetic
trees with greater accuracy? We
have already made significant headway on the first question by using known
phylogenetic histories with statistically based techniques, and we are beginning
to obtain results on the other two questions.
The research is by nature very computational, and there is much room for
further computational development.
There are currently more than 10,000 complete (or nearly complete) small subunit rRNA
sequences and 1,300 complete (or nearly complete) large subunit rRNA sequences.
This large collection of sequences has been used extensively by Gutell
for comparative structure analysis to determine and refine the structural models
of rRNA. This same collection of
sequences has been aligned based on similar sequence and structure; the small
subunit rRNA data matrix represents the largest set of aligned genes that spans
the Tree of Life. Moreover, the
number of rRNA sequences continues to increase.
On average there are approximately 50 new complete rRNA sequences that
become available in GenBank every week. Thus,
the Gutell lab continues to identify, enter, and align new rRNA sequences into
their large collection of rRNA sequences and alignments.
The Gutell lab has been analyzing ribosomal RNA from a comparative perspective for
the past 20 years. During this
time, Gutell has developed a Comparative RNA Web site (http://www.rna.icmb.utexas.edu/)
that reports and facilitates detailed analyses of the rRNA genes.
The aligned rRNA genes represent an enormous opportunity for the analysis
of a large phylogenetic dataset that represents samples from throughout life,
and successful phylogenetic analysis of this data set is important for extending
structural and functional predictions of rRNA.
The number of possible secondary structure models, composed of standard
secondary structure helices with three or more consecutive G:C, A:U, and G:U
basepairs, is extremely large. For example, the approximate number of secondary structure models for 16S rRNA, a
molecule with approximately 1500 nucleotides, is 4.3 x 10393.
Given our limited knowledge of the principles of RNA structure, we are
currently unable to predict with confidence the best secondary structure models
from first principles. For the same reasons, we are also unable to accurately predict tertiary interactions.
Fortunately, it was realized by Woese and others (Fox and Woese, 1975)
that all RNA molecules within the same family that perform the same function
(e.g., tRNA, 16S rRNA) will form the same general two- and three-dimensional
structure. This apparently simple concept allows for covariation analysis to predict secondary and tertiary
base-pairings for several RNA molecules, including the 16S (and 16S-like) and
23S (and 23S-like) rRNAs from any species for which the primary sequence is
known (Woese et al., 1983). With the recent publication of the crystal structure solutions for the rRNAs of Thermus
thermophilus (Wimberly et al., 2000) and Holoarcula marismortui (Ban et al., 2000), we have evaluated the
accuracy of these predicted structures. Approximately 98% of the predicted 16S and 23S rRNA basepairs are indeed present in the
crystal structures. In addition,
comparative analysis has also been used to independently determine several
structural motifs, such as tetraloop hairpin loops, E and E-like loops, AA and
AG base-pairings at the ends of helices, and frequent adenosines in unpaired
positions (Gutell et al., 2000). The
comparative structure analysis that has successfully identified these
relationships is based on the structure-based alignment of the individual rRNA
sequences and the development and implementation of the covariation and other
comparative algorithms.
The extreme accuracy in predicting the base-pairings in the rRNA structures
validates the comparative analysis paradigm and supports the accuracy of the
structure-based alignments. In
addition, the predicted structures for all of the rRNAs, including elements
present in rRNA sequences that do not have a homologous component in the
structures solved in the crystal structure, are likely accurate.
The net result is that we have excellent structural predictions for each
of the rRNA sequences in our data collection.
There is considerable opportunity for reciprocal feedback between computational
phylogenetics and the structural and functional analysis of this enormous
collection of rRNA sequences. Thorough
phylogenetic analysis of the complete database requires methods for analyzing
very large phylogenetic trees. As
these phylogenetic analyses are completed, we anticipate that the models and
alignments can be refined to an even greater degree. The structural analyses, in turn, provide an additional
source of data for analysis, beyond the primary structure. This information is important for phylogenetic analysis in
several ways. This system may
provide one of the best test cases for analyzing and understanding the
interdependencies among nucleotide positions within a gene.
Furthermore, the higher-order structures may themselves be useful for
inferring phylogenetic relationships, especially at the deeper levels of the
tree where the primary structure analysis is least reliable.
Eventually, we imagine the inference of ancestral three-dimensional structures for rRNAs
from throughout the Tree of Life. These
structures can then be visualized in three dimensions, and used to create movies
of the evolution of ribosomal RNA structure from the origin of life to the
present along any given phylogenetic pathway. By working with the scientific visualization group, the
inferred structures can be used to create a database that allows the exploration
of structural evolution of rRNA, which can then be used to study the
structural-functional relationships of rRNA motifs.