Living cells sustain and regulate themselves by producing various proteins, assembling them from smaller chemical units (amino acids) according to instructions that are encoded in the structure of molecules of deoxyribonucleic acid (DNA).
A DNA molecule is composed of a large number of ``bases,'' submolecules of four types: adenine, cytosine, guanine, and thymine. The general shape of the molecule is like a ladder, one end of which has been repeatedly twisted, so that each of the uprights is a helix. Each of the rungs of the ladder consists of a pair of bases, either with adenine at one end and thymine at the other, or with cytosine at one end and guanine at the other. Any base can occur at either end of a rung.
The directions for constructing proteins are encoded in the sequence of bases attached to one of the two uprights of this molecular ladder. (The sequence of bases on the other upright contains the same information, encoded in ``complementary'' form, like a photographic negative; cells use this complementary encoding in the process of duplicating the instructions and transporting them to a cellular workshop for protein construction.)
Each group of three adjacent bases along one upright of a DNA molecule is a codon. A codon encodes the instruction to place some particular amino acid at a position in the protein that corresponds to the position of that codon. The mapping from codons to amino acids (the genetic code) is constant.
The file /home/stone/courses/scheme/data/CHROMOSOME_II.dna contains the sequence of bases found on chromosome 2 of the nematode Caenorhabditis elegans, as determined by the C. elegans Genome Project at The Sanger Center. The first line of the file is a label, reading ``>CHROMOSOME_II''. The remaining lines contain the bases, denoted by the lower-case letters a, c, g, and t, fifty bases to a line, except that the last line has only six bases on it. At a few positions in the sequence, the base has not yet been determined. If a number of undetermined bases occur in a row, they are represented by hyphens. An isolated undetermined base is represented by the lower-case letter n.
Determine the number of known occurrences of each of the bases a, c, g, and t in this sequence. (Hint: Do not attempt to read the entire data set into one Scheme data structure -- it's too large. Use file recursion instead.)
A sequence of bases that directs the construction of a protein typically
begins with the three-base ``start codon'' atg and ends with
one of the ``stop codons'' taa, tag, and
tga.
Determine how many bases precede the first occurrence of the start codon in the CHROMOSOME_II.dna sequence and how many codons there are after this start codon up to, and including, a stop codon.
In looking for the start codon, you'll need to check every group of three
adjacent bases, even groups that overlap, since the start codon can occur
at any position. However, it determines the ``alignment'' of the codons
that follow it. Thus, in looking for the stop codon, you examine bases
only in separate and non-overlapping groups of three. For instance, the
sequence taa is not a stop codon if the t is the
last base in one codon and the aa are the first two bases of
another. The start codon, in effect, tells you how to break the ensuing
bases into codons until the next stop codon is reached.
This exercise is due at 9 a.m. on Friday, April 7, 2000. I'll expect to receive the source code for your program and the answers to the two questions posed above, as determined by your program. As usual, you can submit these items either by printing the files and turning in the paper copies or by sending me the files by e-mail.
This document is available on the World Wide Web as
http://www.cs.grinnell.edu/~stone/courses/scheme/exercise-3.xhtml
created March 12, 2000
last revised April 6, 2000