CSC 153: Computer Science Fundamentals Grinnell College Spring, 2005
 
Laboratory Exercise Reading
 

An Introduction to Files

Abstract

This reading introduces the motivation for using file storage and outlines a general approach for processing files.

Some Motivations for File Storage

Up to this point, all programs developed for this course have stored their data in the computer's main memory. This has the advantage that the computer can process data in main memory more quickly and efficiently that it can if data are stored elsewhere.

On the other hand, storing data in main memory also creates some difficulties. For example, all data must be provided to the program each time the program is run, and the amount of data is limited by the size of main memory. In addition, except for information specified in define statements, any data passed through parameters or computed during procedure execution are destroyed after each procedure is completed executing, and data are not stored from one use of Scheme to the next.

This and the next lab introduce files as a mechanism to overcome some of these disadvantages. Specifically, disk files provide a way to store information in bulk over a long period of time. In applications, we may give up some efficiency, as data on files normally must be read into main memory before the data can be used. Similarly, any results we want stored permanently must be written out explicitly to a disk file. While this storing and retrieving of data requires some specific work, the use of files overcomes the disadvantages mentioned above.

Since the storage of data within files is so important, at least three perspectives have evolved for file processing:

  1. streams: files are considered to contain a sequence or stream of data, such as numbers or characters,

  2. text files: data within files are considered to be organized line-by-line, and

  3. random-access files: data are placed in files, to be accessed in any order.

This lab introduces file streams, which are viewed as sequences of data; the next lab discusses text files. Random-access files constitute a more advanced topic and are beyond the scope of this course.

To introduce some basic ideas of streams, this lab begins by developing solutions for three problems. In each case, we assume that data have been placed in a file using an editor (such as emacs). That is, we assume that we have typed data into an editor and saved the data in a file which we have named. As with other files we edit, we could go back to our editor to revise or expand the data whenever we wanted.

For the problems that follow, the lab first describes a solution by using pseudocode -- a step-by-step outline of what we need to do to solve the problem. Then, the lab develops the Scheme implementations of those solutions.

Problem 1:

Find the sum of the numbers in a given file.

Solution Outline:

The general form for this problem follows a common approach for much file processing. First, we open the file; then, we read and process data; finally, we close the file and print the results. The pseudocode solution that follows adds a few more details:

  1. Open the file containing the numbers.
  2. Initialize a running total to 0.
  3. Until the end of the file is reached:
    1. Read in the next number from the file.
    2. Add it to the running total.
  4. Close the file.
  5. Report the final value of the running total.

Solution in Scheme:

Here are the built-in Scheme procedures, together with some programming hints, that can help us do various parts of this job:

Here is the Scheme implementation of the pseudocode:
(define sum-of-file
  (lambda (source-file-name)
  ;Pre-condition:  source-file-name is the logical name of a file of numbers
  ;Post-condition:  returns sum of numbers in the given file
    (let ((source (open-input-file source-file-name)))
                                        ; Open the file.
      (let loop ((total 0)              ; Initialize the running total.
                 (next (read source)))  ; Try to read a number.
        (if (eof-object? next)          ; If you get the end-of-file object,
            (begin
              (close-input-port source) ; close the file
              total)                    ; and report the final total.
            (loop (+ next total)        ; Otherwise, add the number to
                                        ;    the running total,
                  (read source)))))))   ; try to read another number,
                                        ;    and repeat the loop.
A typical interaction using this procedure would look like this:
> (sum-of-file "/home/walker/151s/labs/file1.dat")
200
The file /home/walker/151s/labs/file1.dat contains the four numbers
50
50
75
25

In reviewing this processing, note that the data file was viewed as containing a sequence or stream of numbers. Processing data in this file then involved reading the numbers, value-by-value, until we reached the end of the file.

Character-by-Character Processing:

Problem 2:

Copy one file to another. That is, create a second file which is character-by-character identical to the first.

Solution Outline:

This time we view the file as a sequence of characters. The pseudocode solution to this problem is:

  1. Open the input file.
  2. Open the output file.
  3. Read a character
  4. Until the end of the file is reached:
    1. Write the previous character
    2. Read the next character
  5. Close the input file.
  6. Close the output file.

Solution in Scheme:

Scheme contains procedures read-char and write-char for processing one character, just as procedures read and write or display handle numbers or symbols or strings or Booleans or lists. That is, read-char and write-char transfer a single character from or to the file. As with the read procedure, read-char returns an end-of-file object when it comes to the end of a file.

Just as we used the open-input-file and close-input-port when reading from a file, we use procedures open-output-file and close-output-port when writing to a file.

The Scheme implementation of the above outline follows.

(define copy-file
  (lambda (source-file-name target-file-name)
  ;Pre-condition:  source-file-name is the logical name of a file
  ;Post-condition:  copies contents of source file to target file
    (let ((source (open-input-file source-file-name))
          (target (open-output-file target-file-name)))
      (let loop ((ch (read-char source)))
        (if (eof-object? ch)
            (begin
              (close-input-port source)
              (close-output-port target))
            (begin
              (write-char ch target)
              (loop (read-char source))))))))

Counting Words in a File:

The following problem may be useful in estimating the length of a paper, as one rule of thumb states that a typical page of double spaced manuscript contains about 12 words per line and about 27 usable lines (excluding margins) for a total of about 324 words.

Problem 3:

Approximate the number of words in a file.

Solution Outline:

To approach this problem, we need a working definition of a "word". One possibility is that a word is any sequence of character that begins with a letter and ends with whitespace (e.g., a space, a tab, or a newline).

To count words, we read a file one character at a time. When we first encounter a letter, we have another word. We then skip over additional letters (or non-white space) as part of that word. One possible outline for such a solution follows:

  1. Open the file.
  2. Continue until all characters are read:
    1. Read characters to locate the start of a word.
    2. If a word is found,
      1. increase the word count by 1,
      2. read characters to locate the end of the word.
  3. Close the file.
  4. Report the final value of the running total.

Solution in Scheme:

Since the finding the start or the end of a word requires checking various details, we write steps 2A and 2Bii as separate, mutually-recursive, local procedures. Thus, find-start-word reads character-by-character until finding the start of the word (step 2A), while find-end-word reads character-by-character until the end of the current word (step 2Bii). All of step 2 is accomplished by having these two procedures call each other. The resulting code follows:

(define count-words
   (lambda (source-file-name)
   ;Pre-condition:  source-file-name is the logical name of a file
   ;Post-condition:  returns number of words in the given file
      (letrec ((source (open-input-file source-file-name))
               (print-result
                  (lambda (count)
                      (display "File ")
                      (display source-file-name)
                      (display " contains ")
                      (display count)
                      (display " words.")
                      (newline)
                  ))
               (find-start-word 
                  (lambda (next-char count)
                     (cond ((eof-object? next-char) 
                                (begin
                                    (close-input-port source)
                                    (print-result count)
                                ))
                           ((char-alphabetic? next-char) 
                                (find-end-word (read-char source) (+ 1 count)))
                           (else 
                                (find-start-word (read-char source) count))
                     )
                  ))
               (find-end-word 
                  (lambda (next-char count)
                     (cond ((eof-object? next-char)
                                (begin
                                    (close-input-port source)
                                    (print-result count)
                                ))
                           ((char-whitespace? next-char) 
                                (find-start-word (read-char source) count))
                           (else 
                                (find-end-word (read-char source) count))
                     )
                  ))
               )
          (find-start-word (read-char source) 0)
      )
   )
)

This document is available on the World Wide Web as

http://www.walker.cs.grinnell.edu/courses/153.sp05/labs/lab-file-intro.shtml

material for Problems 1 and 2 created in two labs on March 11, 1997 by John D. Stone
material merged and reorganized April 5, 1999 by Clif Flynt and Henry M. Walker
last revised February 7, 2005 by Henry M. Walker
Valid HTML 4.01! Valid CSS!
For more information, please contact Henry M. Walker at walker@cs.grinnell.edu.