Years of biological experiments have produced descriptions of what
occurs during the genetic replication, transcription, and
translation processes. In translation for instance, molecular
biologists have identified key regions upstream and downstream of
the initiation codon that affect the ribosome's ability to
initiate translation and the rate at which translation initiation
occurs. But the specific effects of base composition and distance
of key bases from the initiation site is not completely understood
and has not been mathematically quantified. Development of a
mathematical model that describes the regulatory regions
on mRNA sequences which control ribosomal attachment and the rate
of translation initiation, would allow the construction of optimal
translation
initiation sites. These optimal sites can be used in transgenic
protein production (using an organism to produce proteins foreign
to that organism's genome), increasing the expression of reporter
proteins, and regulating the expression of proteins
useful for various bioengineering applications.
Compiling a set of viable regulatory sequences would prove
experimentally intractable. If we limited our search of viable
translation regulatory sites to sixty base sequences, we would
examine at least 58^4 sequences (assuming only the first
base of the initiation codon is variable). This is not a viable
option for biologists or biotechnologists. Developing a
mathematical framework that correlates base composition and base
location in regulatory sequences with corresponding genetic
regulatory response will provide a mathematically detailed
understanding of genetic regulation, produce a tool for optimizing
sequences involved in genetic regulatory control, and contribute
to the understanding of genetic networks.
Informational analysis of genetic sequences has provided
significant insight into parallels between the genetic process and
information processing systems used in the field of communication
engineering. Of particular interest are the results from
Battail[Battail, 1997],
Schneider et al., [Schneider, 1997, Schneider et al., 1986] and Eigen
[Eigen, 1993]. Drawing from their work and previous work in
protein annotation and gene identification, we can theorize that
the transmission of genetic information can be viewed as a
biological, cellular communication system that employs some method
of coding to recognize valid information regions and to correct
for "transmission" errors.
Using information theory,
coding theory specifically, the translation of messenger RNA
(mRNA) into amino acid sequences is functionally paralleled to the
decoding of noisy, convolutionally encoded parity streams. The
ribosome is modeled as a table-based convolutional decoder [May et al.,
2000; May et al., 2002].
Although one does not know the exact mechanism employed by the
genetic decoder, by analyzing key elements involved in initiating
protein translation, it is hoped that we will gain insight into
possible decoding schemes used in the initiation of translation in
prokaryotic organisms. The key elements considered are: the 3'
end of the 16S ribosomal RNA, the common features of bacterial
ribosomal binding sites (such as the existence and location of the
Shine-Dalgarno sequence), and RNA/DNA base-pairing principles.
This work presents an evolutionary computing method for the design of
optimal table-based convolutional coding models for prokaryotic
translation initiation sites using Escherichia coli K-12 as
the model organism.
We use evolutionary computing methods such as genetic algorithms (GAs)
to search for table-based
convolutional codes whose gmasks recognize individual mRNA leader
sequences. The search space consists of all possible (n=3D3,
k=3D1, m=3D4) convolutional codes or individuals. The fitness of =
each
individual in the population (a set of potential solutions) is
based on the syndrome values produced when the code's gmasks are
applied to the mRNA parity sequence. An all zero syndrome value
indicates that no errors within the code's error detection
capability occurred. In the GA approach, random selection and target sampling rates
are used to select highly fit individuals for reproduction. New
populations are created using parameterized uniform crossover.
Mutation is used to preserve population diversity and elitism
ensures that the most fit solution is not discarded.
We explore and compare several categories of
error-control codes, including: horizontal, vertical, equal and
unequal error protection (UEP) codes. Results show that UEP code
models recognize the non-random and Shine-Dalgarno domain of mRNA
leaders better than equal error protection models. Codes whose
decoding masks (gmasks) have high similarity to the 3' end of the
16S ribosomal RNA (rRNA) were discovered. Additional results are
presented and the need for effective iterative methods for biological
coding theory applications such as ours are discussed.