This assignment is optional!
As you may have noted from the discussion of DNA sequencing techniques, one particularly popular technique is shotgun sequencing. In this technique, we start with multiple copies of the DNA to be sequenced, break each copy into lots of small segments (1000 or so nucleotides each), sequence each of those using the Sanger method, and then use a program to somehow join them together.
Writing the shotgun assembly routine is beyond the scope of this class. However, we can at least consider the effects of shotgun sequencing.
We will try two approaches. One, which we might call the pure computer science approach, is to randomly grab subsequences. That is, you should select a random starting point and a random length (within the range of valid lengths) and then extract the substring of that length starting at the selected point. A second, which we might call a hybrid approach, is to cut the sequence at particular patterns (which correspond to what Sam thinks are called restriction enzymes).
In this assignment, you will explore the strengths and weaknesses of these two approaches.
1. Write a procedure,
generate_fragments(sequence, minlen, maxlen, coverage) that generates a bunch of random fragments from the sequence with fragments between minlen and maxlen. You should generate enough fragments to get the average coverage specified.
This procedure represents the pure computer science approach. As you know from the reading, we want to generate enough fragments to cover each base multiple times, which helps deal with the difficulty of aligning all of the fragments.
You know that you've generated enough fragments when the total length of the
fragments generated is about
Here's a really simple example of
generate_fragments in action.
>>> generate_fragments ("abcd", 1, 3, 4) ["b", "bcd", "ab", "a", "abc", "ab", "bc", "bcd"]
You will note that the fragments are all between one and three characters long, that they start at different places, and that they total at least 16 characters.
2. Write a procedure,
that builds a new set of fragments by cutting each fragment in
fragments at the portion that matches pattern.
This procedure is supposed to mimic what happens with restriction enzymes.
>>> cut_fragments (["alphabet", "alabama"], "a") ["lph", "bet", "l", "b", "m"] >>> cut_fragments (["this is a longer string"], "a") ["this is ", " longer string"]
Why do we start with a list of fragments, rather than a single fragment? Because it is unlikely that one restriction enzyme alone will cut the sequence into small enough segments. Hence, we may apply it multiple times.
>>> tmp = cut_fragments(["this is a longer string"], "a"]) >>> cut_fragments (tmp, "s") ["thi", " i", " ", " longer ", "tring"]
3. Write a procedure
filter_fragments(fragments, minlen, maxlen)
that takes a list of fragments and removes those that are smaller
than minlen and larger than maxlen.
This procedure is supposed to deal with the important issue that we typically run Sanger sequencing only on segments of a particular length.
>>> filterfragments(["thi", " i", " ", " longer ", "tring"], 3, 5) ["this", "tring"]
4. If we were going to use these procedures for making input for a real program, we would want to make sure that they achieved what we think that they achieve: thorough coverage of the original sequence.
Write a procedure,
that aligns each fragment to the sequence and indicates how much each
position is covered.
5. Present your program and some interesting sample runs of your program that show how well the two fragmentation techniques work.
I usually create these pages
on the fly, which means that I rarely
proofread them and they may contain bad grammar and incorrect details.
It also means that I tend to update them regularly (see the history for
more details). Feel free to contact me with any suggestions for changes.
This document was generated by
Siteweaver on Fri Sep 30 10:42:55 2011.
The source to the document was last modified on Fri Sep 30 10:42:50 2011.
This document may be found at
You may wish to validate this document's HTML ; ;Samuel A. Rebelsky, firstname.lastname@example.org
http://creativecommons.org/licenses/by-nc/2.5/or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.