Laboratory Exercises For Computer Science 151

An Introduction to Files

An Introduction to Files

Goals: This laboratory exercise introduces the motivation for using file storage and outlines a general approach for processing files.

Some Motivations for File Storage

Up to this point, all programs developed for this course have stored their data in the computer's main memory. This has the advantage that the computer can process data in main memory more quickly and efficiently that it can if data are stored elsewhere.

On the other hand, storing data in main memory also creates some difficulties. For example, all data must be provided to the program each time the program is run, and the amount of data is limited by the size of main memory. In addition, except for information specified in define statements, any data passed through parameters or computed during procedure execution are destroyed after each procedure is completed executing, and data are not stored from one use of Scheme to the next.

This and the next several labs introduce files as a mechanism to overcome some of these desadvantages. Specifically, disk files provide a way to store information in bulk over a long period of time. In applications, we may give up some efficiency, as data on files normally must be read into main memory before the data can be used. Similarly, any results we want stored permanently must be written out explicitly to a disk file. While this storing and retrieving of data requires some specific work, the use of files does overcome the disadvantages mentioned above.

Since the storage of data within files is so important, at least three perspectives have evolved for file processing:

  1. streams: files are considered to contain a sequence or stream of data, such as numbers or characters,

  2. text files: data within files are considered to be organized line-by-line, and

  3. random-access files: data are placed in files, to be accessed in any order.

This lab introduces file streams, which are viewed as sequences of data; the next lab discusses text files. Random-access files constitute a more advanced topic and are beyond the scope of this course.

To introduce some basic ideas of streams, this lab begins by developing solutions for three problems. In each case, we assume that data have been placed in a file using an editor (such as xemacs). That is, we assume that we have typed data into an editor and saved the data in a file which we have named. As with other files we edit, we could go back to our editor to revise or expand the data whenever we wanted.

For the problems that follow, the lab first describes a solution by using pseudocode -- a step-by-step outline of what we need to do to solve the problem. Then, the lab develops the Scheme implementations of those solutions.


Problem 1: Find the sum of the numbers in a given file.

Solution Outline: The general form for this problem follows a common approach for much file processing. First, we open the file; then, we read and process data; finally, we close the file and print the results. The pseudocode solution that follows adds a few more details:

  1. Open the file containing the numbers.
  2. Initialize a running total to 0.
  3. Until the end of the file is reached:
    1. Read in the next number from the file.
    2. Add it to the running total.
  4. Close the file.
  5. Report the final value of the running total.

Solution in Scheme: Here are the built-in Scheme procedures, together with some programming hints, that can help us do various parts of this job:

Here is the Scheme implementation of the pseudocode:
(define sum-of-file
  (lambda (source-file-name)
  ;Pre-condition:  source-file-name is the logical name of a file of numbers
  ;Post-condition:  returns sum of numbers in the given file
    (let ((source (open-input-file source-file-name)))
                                        ; Open the file.
      (let loop ((total 0)              ; Initialize the running total.
                 (next (read source)))  ; Try to read a number.
        (if (eof-object? next)          ; If you get the end-of-file object,
            (begin
              (close-input-port source) ; close the file
              total)                    ; and report the final total.
            (loop (+ next total)        ; Otherwise, add the number to
                                        ;    the running total,
                  (read source)))))))   ; try to read another number,
                                        ;    and repeat the loop.
A typical interaction using this procedure would look like this:
> (sum-of-file "/home/walker/151s/labs/file1.dat")
200
The file /home/walker/151s/labs/file1.dat contains the four numbers
50
50
75
25


  1. Compare this program with the pseudocode solution given earlier. Identify how each part of the pseudocode is reflected in the Scheme code.

  2. Trace this program for the data file given. Be sure you can explain how it works.

  3. Explain why the expression (read source) appears twice in the above code. What is the purpose of each appearance of this expression?

In reviewing this processing, note that the data file was viewed as containing a sequence or stream of numbers. Processing data in this file then involved reading the numbers, value-by-value, until we reached the end of the file.


Character-by-Character Processing:

Problem 2: Copy one file to another. That is, create a second file which is character-by-character identical to the first.

Solution Outline: This time we view the file as a sequence of characters. The pseudocode solution to this problem is:

  1. Open the input file.
  2. Open the output file.
  3. Read a character
  4. Until the end of the file is reached:
    1. Write the previous character
    2. Read the next character
  5. Close the input file.
  6. Close the output file.

Solution in Scheme: Scheme contains procedures read-char and write-char for processing one character, just as procedures read and write or display handle numbers or symbols or strings or Booleans or lists. That is, read-char and write-char transfer a single character from or to the file. As with the read procedure, read-char returns an end-of-file object when it comes to the end of a file.

Just as we used the open-input-file and close-input-port when reading from a file, we use procedures open-output-file and close-output-port when writing to a file.

The Scheme implementation of the above outline follows.

(define copy-file
  (lambda (source-file-name target-file-name)
  ;Pre-condition:  source-file-name is the logical name of a file
  ;Post-condition:  copies contents of source file to target file
    (let ((source (open-input-file source-file-name))
          (target (open-output-file target-file-name)))
      (let loop ((ch (read-char source)))
        (if (eof-object? ch)
            (begin
              (close-input-port source)
              (close-output-port target))
            (begin
              (write-char ch target)
              (loop (read-char source))))))))


  1. Check that this procedure works as claimed by using it to copy the file /home/walker/151s/labs/file2.dat to a file named lab.data in your account.

  2. Modify this procedure so that every lower-case letter that is read in is converted to upper case before being written to the output file.

  3. Write a Scheme procedure tally-char that takes two arguments, the name of an input file and a character, and returns a tally of the number of occurrences of that character in the specified file.

    (tally-char "/home/walker/151s/labs/file1.dat" #\5) ===> 4
    (tally-char "/home/walker/151s/labs/file2.dat" #\0) ===> 16
    (tally-char "/home/walker/151s/labs/file2.dat" #\newline) ===> 3
    
    Hint: Within a main loop, add a parameter to contain the desired count, and update the count appropriately in any recursive call(s).

  4. Assume that a sentence is any sequence of characters ending with a period, question mark, or exclamation point. Modify tally-char, to get a procedure count-sentences, which determines the number of sentences in a file.

    File /home/walker/151s/labs/lab-file-description contains the first paragraph of introductory material in this lab, starting "Up to this point, ...". Check that this paragraph contains 2 sentences:

    (count-sentences "/home/walker/151s/labs/lab-file-description") 
         ===> 2
    


  1. Write and test a Scheme procedure that takes two arguments -- the name of an input file containing zero or more integers, and the name of an output file to be created by the procedure -- and copies each integer from the input file to the output file if it is in the range from 0 to 99. Values outside of this range should be read in but not copied out again. The idea is that this procedure will act as a filter, ensuring that only the values that are in the correct range will make it into the output file.

    Line breaks in the input file should be ignored. In the output file, arrange for each integer to be printed on a line by itself.


Counting Words in a File:

The following problem may be useful in estimating the length of a paper, as one rule of thumb states that a typical page of double spaced manuscript contains about 12 words per line and about 27 usable lines (excluding margins) for a total of about 324 words.

Problem 3: Approximate the number of words in a file.

Solution Outline: To approach this problem, we need a working definition of a "word". One possibility is that a word is any sequence of character that begins with a letter and ends with whitespace (e.g., a space, a tab, or a newline).

To count words, we read a file one character at a time. When we first encounter a letter, we have another word. We then skip over additional letters (or non-white space) as part of that word. One possible outline for such a solution follows:

  1. Open the file.
  2. Continue until all characters are read:
    1. Read characters to locate the start of a word.
    2. If a word is found,
      1. increase the word count by 1,
      2. read characters to locate the end of the word.
  3. Close the file.
  4. Report the final value of the running total.

Solution in Scheme: Since the finding the start or the end of a word requires checking various details, we write steps 2A and 2Bii as separate, mutually-recursive, local procedures. Thus, find-start-word reads character-by-character until finding the start of the word (step 2A), while find-end-word reads character-by-character until the end of the current word (step 2Bii). All of step 2 is accomplished by having these two procedures call each other. The resulting code follows:

(define count-words
   (lambda (source-file-name)
   ;Pre-condition:  source-file-name is the logical name of a file
   ;Post-condition:  returns number of words in the given file
      (letrec ((source (open-input-file source-file-name))
               (print-result
                  (lambda (count)
                      (display "File ")
                      (display source-file-name)
                      (display " contains ")
                      (display count)
                      (display " words.")
                      (newline)
                  ))
               (find-start-word 
                  (lambda (next-char count)
                     (cond ((eof-object? next-char) 
                                (begin
                                    (close-input-port source)
                                    (print-result count)
                                ))
                           ((char-alphabetic? next-char) 
                                (find-end-word (read-char source) (+ 1 count)))
                           (else 
                                (find-start-word (read-char source) count))
                     )
                  ))
               (find-end-word 
                  (lambda (next-char count)
                     (cond ((eof-object? next-char)
                                (begin
                                    (close-input-port source)
                                    (print-result count)
                                ))
                           ((char-whitespace? next-char) 
                                (find-start-word (read-char source) count))
                           (else 
                                (find-end-word (read-char source) count))
                     )
                  ))
               )
          (find-start-word (read-char source) 0)
      )
   )
)
  1. Check that this program works by running it on the file /home/walker/151s/labs/lab-file-description which contains the first paragraph of introductory material in this lab, starting "Up to this point, ...". (The paragraph contains 44 words.)

  2. Describe in several sentences how this program works.

    1. What is the purpose of each local procedure?
    2. Why is a letrec used, rather than a named let expression.
    3. find-start-word contains several parameters. What is the purpose of each?
    4. Procedure names normally should be chosen to describe what the procedure does. Do you think the name find-start-word is sufficiently descriptive? If so, explain why. If not, suggest a better name.

  3. count-words is incomplete, in that it contains few comments beyond pre- and post-conditions. Add appropriate commentary to clarify the purpose of each of main part of the code.

  4. Modify count-words so that a word is considered to be only a sequence of letters. That is, for this part, a word is a sequence of letters -- without punctuation or digits.

  5. Challenge Problem: Use the ideas of count-words and count-sentences to write procedure average-words, which determines the average number of words in a sentence. Note that for efficiency, average-words should only read through the file once.


This document is available on the World Wide Web as

http://www.math.grin.edu/~walker/courses/153.sp00/lab-file-intro.html

material for Problems 1 and 2 created in two labs on March 11, 1997 by John D. Stone
material merged and reorganized April 5, 1999 by Clif Flynt and Henry M. Walker
last revised January 10, 2000 by Henry M. Walker