In previous labs you have been introduced to the motivation for and some simple methods of sorting lists. Applications exist in which large volumes of random data are not useful unless we can order them, and the algorithm you learned before can be relatively slow in dealing with large amounts of randomly ordered data.
Before Class: Read Section 9.3 in your text to familiarize yourself with the idea behind Quicksort and the algorithm it uses.
Introduction: Traditional algorithms can take O(n²) operations or more, and this is very time consuming for large lists. We can use recursion to cut the number of operations significantly, down to O(n*logn), if we use recursion wisely. This lab will introduce two methods of doing so, Quicksort and Merge sort. The algorithms have some similarities, but they have an important qualitative difference.
Goals: To gain an understanding of the divide-and-conquer philosophy of sorting and its implementation. Also, to learn about best and worst case scenarios for different sorting algorithms. We will also investigate appropriate problems in which to implement one algorithm over another. We will use ideas presented in the first lab on sorting during this lab.
Divide-and-Conquer Sorting:
Quicksort
You should be familiar with Quicksort from your reading. Quicksort puts
elements of the list in particular bins depending on whether they are
larger or smaller than a pivot number, conventionally chosen as the one at
the front of the list. It does this over and over to each list until it
has a sorted set of elements, then it appends these elements into sorted
lists until the entire list is recovered in its new sorted form.
Merge sort
Merge sort works slightly differently. Instead of throwing numbers into
one of two bins depending on a comparison, Merge sort divides its list in
half over and over until it has a set of unsorted elements. Then it uses
a Merge procedure, like the one shown in Program 8-10, page 445 of the
text, to merge the elements into sorted pairs, the pairs into sorted lists
of 4, and so on until the entire list is recovered, again in its new sorted
form.
The algorithms are similar in motivation; we see that both use a form of partitioning, but they differ in the way the re-ordering is actually implemented. The Quicksort algorithm as given in the text follows:
(define quicksort (lambda (x compare) ;; x is the list of numbers to be sorted ;; compare is a comparison procedure, returning ;; 'less-than, 'equal-to, or 'greater-than (if (null? x) x (let* ((pivot (car x)) (smaller '()) (equal '()) (larger '()) (classify (lambda (item) (case (compare item pivot) ((less-than) (set! smaller (cons item smaller))) ((equal-to) (set! equal (cons item equal))) ((greater-than) (set! larger (cons item larger))))))) (for-each classify x) ; remove the following 2-line comment to trace this code ; (display (format "smaller: ~a equal: ~a larger ~a ~%" ; smaller equal larger)) (append (quicksort smaller compare) equal (quicksort larger compare))))))
1. Copy the above code into a file, and explain how it works by inserting comments where appropriate.
The code in this lab requires several helper procedures, which you need
to load into Chez Scheme with the command:
> (load "/home/barkley/scheme/sorting-helpers.ss")
or
> (load "/home/walker/153/labs/sorting-helpers.ss")
The helper procedures will be introduced as we go.
As noted in the above comment, this implementation of Quicksort requires
a helper procedure as input as a parameter. Since the examples that follow
focus on the sorting of numbers, we will use the text's
compare-numbers
procedure (page 426), which returns the result
of comparing two numbers. Also available is a compare-strings
helper; others could be coded fairly easily, depending on what kind of sort
you wish to do. This feature of this particular implementation of
Quicksort gives it a lot of flexilility.
2. Now run the Quicksort algorithm on a list of ten or so numbers to make sure it works as it should. For example, you might try
(quicksort '(3 1 4 1 5 9 2 6 5 3) compare-numbers)
3. Using the specifications given above and assuming you have a helper
procedure identical to the merge-lists
procedure on page 445
of the text, write a Merge sort program.
4. You loaded merge-lists
in the
sorting-helpers
file, so now test your merge sort program with
the same list you used to test Quicksort.
Traditional versus Partition Sorting:
To understand the real difference between sorting algorithms, we should do some timing tests. You also loaded the insertion sort from the first sorting lab with the command above, so we will use it for comparison.
5. The following command will create an unordered list of 5,000 random
integers, each between 0 and 999. We will use this list to begin our
testing.
> (define test5k (make-list 5000))
6. Using whatever timepiece you have available, keep track of how long it takes for each of the following commands to complete. Watch carefully for the next prompt to appear, that is your signal that the program has completed. Be sure to write down your results.
> (define 5k-i (insertion-sort test5k)) > (define 5k-m (Merge-sort test5k compare-numbers)) > (define 5k-q (Quicksort test5k compare-numbers))
7. Now let's see what happens when we use mostly-ordered lists. Use the following commands to set up new lists and test them with our sorting procedures. Time the procedures and write down your results.
> (define 5k-i (append 5k-i '(1340 30 1253 795 2640 230 1001))) > (define 5k-i (insertion-sort 5k-i)) > (define 5k-m (append 5k-m '(1340 30 1253 795 2640 230 1001))) > (define 5k-m (Merge-sort 5k-m compare-numbers)) > (define 5k-q (append 5k-q '(1340 30 1253 795 2640 230 1001))) > (define 5k-q (Quicksort 5k-q compare-numbers))
Why do you think the results were so different?
8. Now make a list with twice as many elements, using the command:
> (define test10k (make-list 10000))
Consider that the insertion sort runs in O(n²) operations and the recursive sorts run in O(n*logn) operations. Using your data for the measurement on a randomly ordered list of 5,000 elements, predict how long each algorithm should take to run on a 10,000 element list. Do a time trial for each of the three sorting procedures on this new list to check your predictions. Use these commands:
> (define 10k-i (insertion-sort test10k)) > (define 10k-m (Merge-sort test10k compare-numbers)) > (define 10k-q (Quicksort test10k compare-numbers))
9. Take more data points for each sorting procedure. For the insertion sort, time it on lists of 100, 1,000, and 3,000 elements, and for the recursive sorts test on lists of 20,000, 50,000, and 100,000 elements. For each algorithm, make a plot of running time versus size of the list. What trends do you see? For each algorithm, by what sort of mathematical function does running time seem to increase with size of the list?
Sorting Vectors:
As you learned in your reading, the overhead of list operations makes a sorting algorithm like Quicksort or Merge sort less efficient. If we can operate on vectors rather than lists, we may see some improvement on running time because we make the (invisible) constant factor in O(n*logn) smaller.
10. Use the following sequence of commands to investigate the vector Quicksort as given in your text on page 495, and time the algorithm just as before.
> (define vtest100k (list->vector test100k)) > (define v100k-q (qsort vtest100k 0 99999))
Is there a large difference between the vector Quicksort and the list Quicksort?
11. (Extra Credit) Rewrite your Merge sort algortithm and the
merge-lists helper for a vector implementation. Do the same timing
comparison as above for the two implementations of Merge sort. Code for
merge-lists is available in the text on page 445 or you can copy it to your
account from
~barkley/scheme/merge-lists.ss
Some Questions:
As you consider these, think about the most efficient way to solve each problem, both in terms of programmer time (i.e. an insertion sort is simple and quick to code) and in terms of running time.
A. Consider a set of data gathered in a large random survey where questionaires are mailed back from survey-takers to the company. If the company put records into a database in the order that the questionaires came in, then wanted an alphabetical listing of those who returned surveys, which of the sorting algorithms that you have seen should be the most efficient and why?
B. Consider an established phone directory that is updated yearly, when new listings are added to an old set of data. Which sorting algorithm should be the most efficient and why?
C. Each year Grinnell College publishes a booklet with a listing of new students and their pictures. Naturally, the listing is in alphabetical order by last name. Inside the back cover of this booklet is a different listing of the same students, a listing that may be even more useful to people trying to find out who other people are. The second listing is alphabetical by first name rather than last, so students can match the first name and image of someone they just met with that student's last name as well. No graduating class at Grinnell is very large, none containing more than a few hundred students. If you had this listing of last names and wanted to sort it by first names, what sorting algorithm would you use and why?
D. Suppose a very large university, maybe 10,000+ students per class, wanted to make a similar publication, and needed to sort the last-name alphabetical listing by first name. What sorting procedure would be best in this case, and why?
Work to be turned in:
This document is available on the World Wide Web as
http://www.math.grin.edu/~walker/courses/153/.html http://www.math.grin.edu/~barkley/lab-sorting.html
created April 28, 1998 by Scott G. Barkley
revised April 29, 1998 by Henry M. Walker