BIO/CSC 295 2011F Bioinformatics

Assignment: Shotgun Fragmentation

This assignment is optional!

As you may have noted from the discussion of DNA sequencing techniques, one particularly popular technique is shotgun sequencing. In this technique, we start with multiple copies of the DNA to be sequenced, break each copy into lots of small segments (1000 or so nucleotides each), sequence each of those using the Sanger method, and then use a program to somehow join them together.

Writing the shotgun assembly routine is beyond the scope of this class. However, we can at least consider the effects of shotgun sequencing.

We will try two approaches. One, which we might call the pure computer science approach, is to randomly grab subsequences. That is, you should select a random starting point and a random length (within the range of valid lengths) and then extract the substring of that length starting at the selected point. A second, which we might call a hybrid approach, is to cut the sequence at particular patterns (which correspond to what Sam thinks are called restriction enzymes).

In this assignment, you will explore the strengths and weaknesses of these two approaches.

1. Write a procedure, generate_fragments(sequence, minlen, maxlen, coverage) that generates a bunch of random fragments from the sequence with fragments between minlen and maxlen. You should generate enough fragments to get the average coverage specified.

This procedure represents the pure computer science approach. As you know from the reading, we want to generate enough fragments to cover each base multiple times, which helps deal with the difficulty of aligning all of the fragments.

You know that you've generated enough fragments when the total length of the fragments generated is about coverage*len(sequence).

Here's a really simple example of generate_fragments in action.

>>> generate_fragments ("abcd", 1, 3, 4)
["b", "bcd", "ab", "a", "abc", "ab", "bc", "bcd"]

You will note that the fragments are all between one and three characters long, that they start at different places, and that they total at least 16 characters.

2. Write a procedure, cut_fragments(fragments,pattern) that builds a new set of fragments by cutting each fragment in fragments at the portion that matches pattern.

This procedure is supposed to mimic what happens with restriction enzymes.

For example,

>>> cut_fragments (["alphabet", "alabama"], "a")
["lph", "bet", "l", "b", "m"]
>>> cut_fragments (["this is a longer string"], "a")
["this is ", " longer string"]

Why do we start with a list of fragments, rather than a single fragment? Because it is unlikely that one restriction enzyme alone will cut the sequence into small enough segments. Hence, we may apply it multiple times.

>>> tmp = cut_fragments(["this is a longer string"], "a"])
>>> cut_fragments (tmp, "s")
["thi", " i", " ", " longer ", "tring"]

3. Write a procedure filter_fragments(fragments, minlen, maxlen) that takes a list of fragments and removes those that are smaller than minlen and larger than maxlen.

This procedure is supposed to deal with the important issue that we typically run Sanger sequencing only on segments of a particular length.

>>> filterfragments(["thi", " i", " ", " longer ", "tring"], 3, 5)
["this", "tring"]

4. If we were going to use these procedures for making input for a real program, we would want to make sure that they achieved what we think that they achieve: thorough coverage of the original sequence.

Write a procedure, determine_coverage(sequence, fragments), that aligns each fragment to the sequence and indicates how much each position is covered.

5. Present your program and some interesting sample runs of your program that show how well the two fragmentation techniques work.

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Fri Sep 30 10:42:55 2011.
The source to the document was last modified on Fri Sep 30 10:42:50 2011.
This document may be found at

You may wish to validate this document's HTML ; Valid CSS! ; Creative Commons License

Samuel A. Rebelsky,

Copyright © 2009-2011 Vida Praitis and Samuel A. Rebelsky. This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.