Goals: This laboratory exercise introduces some principles of algorithm effectiveness, including the amount of time and memory required for the algorithm. Big-O notation is introduced to provide an informal measure of the time or space required by an algorithm. These ideas are applied to the linear and binary search algorithms, discussed in the lab on searching.
In considering the solution to a problem, it is natural to ask how effective that solution might be. Also, when comparing solutions, one might wonder if one solution were better than another. Altogether, one might use many criteria to evaluate such solutions, including:
The analysis of instructions may take into account the nature of the data -- for example, one might consider what happens in a worst case. Also, such analysis commonly is based on the size of the data being processed -- the number of items or how large or small the data are. This is sometimes called a microanalysis of program execution. Once again, however, the specific instructions may vary from machine to machine, and detailed conclusions from one machine may not apply to another.
A high-level analysis may identify types of activities performed, without considering exact timings of instructions. This is sometimes called a macroanalysis of program execution. This can give a helpful overall assessment of an algorithm, based on the size of the data. However, such an analysis cannot show fine variations among algorithms or machines.
For many purposes, it turns out than a high-level analysis provides adequate information to compare algorithms. For the most part, we follow that approach here.
int[] a = new int [arraySize]; ... // linear search algorithm j = 0; while (j < a.length && item != a[j]) j++; result = (j != a.length);
In executing this code, the machine first initializes j, then the machine goes through the loop (perhaps t times), and finally computes a result. In working through the loop, the condition (j < a.length && item != a[j]) occurs each time and once at the end (t+1 times), and the variable j is incremented t times. Putting all of this together, the amount of work is:
Of course, the amount of time for each action varies from one machine to another. However, suppose that A is the time for initialization, C is the time for checking the loop condition once, I is the amount of time for incrementing i once, and F is the time required for the final computation. Then, the total time for the computation will be:
Overall time = A + (t+1)C + tI + F = t(C+I) + (A+C+F)
Next, suppose the array contains N elements. How many times might we expect to go through the loop? That is, what is a reasonable estimate for t?
If the desired item is not in the array, the answer is easy. We must go through all elements of the array before concluding item is not in the array, and t = N. If item is in the array, we might be lucky and find it at the beginning of the search, or we might be unlucky and find it at the very end. On average, we might expect to about half way through the array. This analysis gives rise to three alternatives:
In practice, it is rarely realistic to hope for the best case, and computer scientists tend not to spend much time analyzing this possibility. The average case often is of interest, but is sometimes hard to estimate. Thus, computer scientists often focus on the worst case. The worst case gives a pessimistic, but possible, view, and usually it is relatively easy to identify. In this case, the average case and worst case analyses have similar forms, although the constants are different:
In a microanalysis, we now could substitute specific values for the various constants to describe the precise amount of time required for the linear search on a specific machine. While this might be helpful for a specific environment, we would have to redo the analysis for each new machine (and compiler). Instead, we take a more conceptual view. The key points of these expressions are that they represent lines -- a linear relationship between overall time and the size N of the array:
Also, for relatively large values of N (i.e., for large arrays), the initial constants A+C+F will have relatively little effect. We can summarize this qualitative analysis by indicating that the overall time is approximately constant * N. As the constant depends on details of a machine and compiler, we focus on this dominant term (ignoring constants), and we say the linear search has order N, written O(N).
The following table gives experimental measurements for the average time required for a linear search for several search trials.
Array | Average Time | Average Time If |
---|---|---|
Size | If Value Found | Value Not Found |
1000 | 620 | 1248 |
2000 | 1260 | 2490 |
4000 | 2540 | 4960 |
Estimate the time for an average linear search of arrays of size 1500, 3000, 8000, and 16000. Briefly justify your answers.
In the previous lab, you developed code to search for an item in an array using a binary search. What follows is one possible version of this code:
// binary search algorithm lo = 0; hi = a.length; mid = (hi + lo)/2; result = false; while (!result && lo < hi) { if (a[mid] == item) result = true; else if (a[mid] < item) lo = mid + 1; else hi = mid; mid = (hi + lo)/2; }
As for the linear search, we would like to estimate the work involved to locate an item in array a, which we will assume has size N. This code allows somewhat more variety than the linear search, as the work within the loop involves several options (either of two conditions could be true or false, and various assignments could result). Thus, we will need some averages about the work needed at various stages. Suppose I is the time for initialization, C is the time for checking the loop condition once, and L is an average time required to execute once the if statements in the body of the loop. Suppose also that the loop is executed t times. Then, the total time for the computation will be:
I + (t+1)C + tL = t(C+L) + (I+C)
While this provides a good start for the analysis, we need some additional study to determine how t relates to the array size N. Here, we might be lucky and find the desired item on the first test, but that seems unlikely, and we ignore that possibility. Also, an average-case analysis is a bit tricky here, so we focus on the worst-case. In the binary search, we start by considering the entire array -- of size N. After one step, we have checked the middle of this array, determined which half the item might be in, and restricted our search to that half. After the second step, we have checked the middle of this half, and restricted the search to half of the half -- or a quarter of the array. More generally, at each stage, the size of the array segment under consideration is halved again. This progression of sizes is shown in the following table:
Step number | Size of Array Still Under Consideration |
---|---|
0 | N |
1 | N/2 = N/21 |
2 | N/4 = N/22 |
3 | N/8 = N/23 |
... | |
t | N/2t |
The process continues, until there is nothing left to search. That is, the size of the array under consideration should be less than 1, or N/2t < 1. This will happen when N is about 2t. Solving for t gives t = log2N. Plugging this into the above equation gives:
Overall time = log2N(C+L) + (I+C)
As before, a macroanalysis ignores proportionality constants from a microanalysis: differences from machine to machine may change a proportionality constant, not the nature of the main terms. As we suggested informally before, the order of an algorithm is the amount of time required to execute an algorithm, ignoring the proportionality constants. In this case, we say a binary search has order log2N, written O(log2N). The overall shape of the curve depends on the nature of the logarithm function, and a rough graph follows:
While this analysis may seem rough, it still can provide some useful insights. For example, the function log2N increases by only 1 if N doubles. Applying this to the above estimate of overall time for the binary search, if the size of an array doubles, then we would expect the time for a binary search to increase only by a small, constant amount (C+L in the above formula).
Summarize the above analysis in your own words in a paragraph.
Continuing part 1 above, the following table gives experimental measurements for the average time required for a binary search for several search trials.
Array | Average Time | Average Time If |
---|---|---|
Size | If Value Found | Value Not Found |
1000 | 33 | 33 |
2000 | 37 | 37 |
4000 | 41 | 41 |
Why do you think the timings here are about the same, regardless of whether the item is found or not?
Estimate the time for an average linear search of arrays of size 1500, 3000, 8000, and 16000. Briefly justify your answers.
Be sure you do the previous parts of this lab before proceeding!!
Program searchTest.java provides a framework for timing the linear and binary search algorithms, as described above. This program illustrates the use of a timing method System.currentTimeMillis(), which returns a time in milliseconds. As the algorithms run very quickly, the program repeats each search 1000 times, so timing measurements in milliseconds will yield appropriate numbers.
The program asks the use to set the minimum and maximum array sizes to be tested, as well as the number of trials to be tested at each array size. Program execution then picks elements at random, applies the search algorithms, and reports the timings. After arrays of one size are tested, the array size is doubled, and the process repeats.
Copy searchTest.java to your account, compile it, and run it several times.
Review the code, and write a paragraph explaining how the code generates its output table. For example, be sure to identify what elements are placed in an array, how an item is selected for starting the search, how timing is done, and what algorithms are tried when.
Run the program for array sizes 1000 through 16000. Then run the program again for array sizes 1500 through 3000. Occasionally, one value in the table may be significantly larger than others. Such anomalies may be explained by various technical details of the operating system and machine environment. Ignoring any such unusual values, how do the results obtained from these runs compare with your estimates earlier in this lab? Briefly discuss any similarities or differences.
This document is available on the World Wide Web as
http://www.math.grin.edu/~walker/courses/153.sp00/lab-complexity.html