Translating RNA


DNA nucleotides are adenine, cytosine, guanine and thymine, usually represented as A, C, G, and T. RNA differs from DNA in that it contains the closely related base uracil (U) instead of thymine. The base pairing rules indicate that an adenine can form a triple-hydrogen bond only with a thymine in the complementary strand. It follows then that the nucleotide bases cytosine and guanine form a (double) hydrogen bond together. A and G are called purines, and C and T are called pyrimdines.

The lexical tokens that will be considered are the nucleotide bases A, C, G, and U, since the primary concern is the translation process. One could also consider the tokens to be codons instead of nucleotides. This might prove to be a better approach, but the former approach will be considered during the course of the development of this compiler using LR(1) parsing and relying on some context sensitivity provided by tools such as Lex and Yacc. It should be noted that at this point that we are dealing with mRNA and not hnRNA (which has to undergo post-transcriptional modification before it can be translated). We therefore have no need to formally specify intron splicing mechanisms. See mRNA.l for the Lex specifications.


Once scanning is complete, the rest of the compilation process involves taking several things into consideration. Even though we are dealing with mRNA and not hnRNA (intron splicing is a problem in itself which will have to be dealt by another transducer specifically construced for that purpose), we have to consider the fact that the amino acid chains assume a 3D configuration and this is how they interact with other proteins. Protein secondary structure is derived from several sources: The sequence of the mRNA strand has an important role in determining primary peptide chain structure. The secondary structure of the mRNA strand also plays a definitive role. The structures of the various translatory mechanisms (ribosomes, tRNA, etc.) help in resolving the protein secondary structure. Finally, as the protein is being translated, the part that has undergone ribsomal processing takes on a 3D configuration and determines the final structure of the protein.

The idea is to come up with a syntax-directed translation that can produce code (polypeptide chains) by (i) looking at mRNA sequence and determining the amino acid chain, (ii) looking at the parse trees of the mRNA sequence and using that information during the encoding, (iii) taking into account several factors such as tRNA and rRNA structure, the wobble hypothesis, and maybe perform some sort of ``error checking'' to account for mutations, and ( iv) perform post-translational modifications as part of an "global optimisation" process.

It is not practical to do all of this in a single step. Rather, a multipass approach will have to be used. The first pass will examine a linear strand of mRNA and produce the corresponding polypeptide. This can be accomplished by specifying a formal grammar for translation.

A formal grammar for mRNA translation

The following set of productions describe how an amino acid chain can be obtained from an mRNA strand:

protein --> mRNA

mRNA --> untranslated_region translated_region untranslated_region

translated_region --> start_codon amino_acid_chain terminator

start_codon -->  met

amino_acid_chain --> amino_acid_chain codon | null

untranslated_region --> untranslated_region untranslated_codon | null

codon --> hydrophobic_side_chain | charged_side_chain | polar_side_chain | gly

hydrophobic_side_chain: ala | val | leu | ile | phe | pro | met

charged_side_chain --> asp | glu | lys | arg

polar_side_chain --> ser | tyr | cys | asn | gln | his | thr | trp

terminator --> ochre | opal | amber

untranslated_codon --> leu | phe | cys | trp | tyr | val | gly | ala | glu | asp | pro |
                       arg | his | gln | ser | thr | ile | asn | lys | ochre | opal | amber

lys --> A A purine        asn --> A A pyrimidine           ile --> A UR pyrimidine
thr --> A C base          met --> A UR G                   ser --> A G pyrimidine
gln --> C A purine        his --> C A pyrimidine           arg --> C G base | A G purine
pro --> C C base          asp --> G A pyrimidine           glu --> G A purine
ala --> G C base          gly --> G G base                 val --> G UR base
tyr --> UR A pyrimidine   trp --> UR G G                   cys --> UR G pyrimidine
phe --> UR UR pyrimidine  leu --> UR UR purine | C UR base

amber --> UR A G          ochre --> UR A A                 opal --> UR G A

base --> purine | pyrimidine 

pyrimidine --> C | UR       purine --> A | G

The file mRNA.y contains the Yacc specifications for the above grammar rules.

What Next?

A representation (data structure) for the amino acids involved is essential to manage their interactions. After the first pass of the compiler, an abstract syntax tree (AST) of amino acids in the form of a linearly linked list is obtained. A ``symbol table'' will have to be developed to account for the number and type of tRNA molecules, ribosomes, and another organelles involved in translation. A study of how amino acids interact, and data provided by techniques such as x-ray diffraction and NMR, will enable us to add more semantic actions to the parser so we can change the structure of the linked list to reflect protein secondary structure (spatial relationships between amino acids). This step could be considering analogous to ``type matching'' in a programming language compiler. Further, a greater level of understanding of amino acid secondary structure interaction will possibly enable us to decipher tertiary and 3D protein structure, thus completing a compiler for mRNA translation.

Genes, Macromolecules, -&- Computing || Pseudointellectual ramblings || Ram Samudrala ||