Use of Iterative Methods in Biological Coding Theory: Applications

Elebeoba E. May
Sandia National Laboratories, P. O. Box 5800 MS 0310, Albuquerque, NM 87185-0310 USA

Mladen A. Vouk, Donald L. Bitzer, and David I. Rosnick
Computer Science Department, North Carolina State University, Raleigh, NC 27695 USA


Abstract


Years of biological experiments have produced descriptions of what occurs during the genetic replication, transcription, and translation processes. In translation for instance, molecular biologists have identified key regions upstream and downstream of the initiation codon that affect the ribosome's ability to initiate translation and the rate at which translation initiation occurs. But the specific effects of base composition and distance of key bases from the initiation site is not completely understood and has not been mathematically quantified. Development of a mathematical model that describes the regulatory regions on mRNA sequences which control ribosomal attachment and the rate of translation initiation, would allow the construction of optimal translation initiation sites. These optimal sites can be used in transgenic protein production (using an organism to produce proteins foreign to that organism's genome), increasing the expression of reporter proteins, and regulating the expression of proteins useful for various bioengineering applications.

Compiling a set of viable regulatory sequences would prove experimentally intractable. If we limited our search of viable translation regulatory sites to sixty base sequences, we would examine at least 58^4 sequences (assuming only the first base of the initiation codon is variable). This is not a viable option for biologists or biotechnologists. Developing a mathematical framework that correlates base composition and base location in regulatory sequences with corresponding genetic regulatory response will provide a mathematically detailed understanding of genetic regulation, produce a tool for optimizing sequences involved in genetic regulatory control, and contribute to the understanding of genetic networks.

Informational analysis of genetic sequences has provided significant insight into parallels between the genetic process and information processing systems used in the field of communication engineering. Of particular interest are the results from Battail[Battail, 1997], Schneider et al., [Schneider, 1997, Schneider et al., 1986] and Eigen [Eigen, 1993]. Drawing from their work and previous work in protein annotation and gene identification, we can theorize that the transmission of genetic information can be viewed as a biological, cellular communication system that employs some method of coding to recognize valid information regions and to correct for "transmission" errors. Using information theory, coding theory specifically, the translation of messenger RNA (mRNA) into amino acid sequences is functionally paralleled to the decoding of noisy, convolutionally encoded parity streams. The ribosome is modeled as a table-based convolutional decoder [May et al., 2000; May et al., 2002]. Although one does not know the exact mechanism employed by the genetic decoder, by analyzing key elements involved in initiating protein translation, it is hoped that we will gain insight into possible decoding schemes used in the initiation of translation in prokaryotic organisms. The key elements considered are: the 3' end of the 16S ribosomal RNA, the common features of bacterial ribosomal binding sites (such as the existence and location of the Shine-Dalgarno sequence), and RNA/DNA base-pairing principles. This work presents an evolutionary computing method for the design of optimal table-based convolutional coding models for prokaryotic translation initiation sites using Escherichia coli K-12 as the model organism.

We use evolutionary computing methods such as genetic algorithms (GAs) to search for table-based convolutional codes whose gmasks recognize individual mRNA leader sequences. The search space consists of all possible (n=3D3, k=3D1, m=3D4) convolutional codes or individuals. The fitness of = each individual in the population (a set of potential solutions) is based on the syndrome values produced when the code's gmasks are applied to the mRNA parity sequence. An all zero syndrome value indicates that no errors within the code's error detection capability occurred. In the GA approach, random selection and target sampling rates are used to select highly fit individuals for reproduction. New populations are created using parameterized uniform crossover. Mutation is used to preserve population diversity and elitism ensures that the most fit solution is not discarded. We explore and compare several categories of error-control codes, including: horizontal, vertical, equal and unequal error protection (UEP) codes. Results show that UEP code models recognize the non-random and Shine-Dalgarno domain of mRNA leaders better than equal error protection models. Codes whose decoding masks (gmasks) have high similarity to the 3' end of the 16S ribosomal RNA (rRNA) were discovered. Additional results are presented and the need for effective iterative methods for biological coding theory applications such as ours are discussed.