Laboratory Exercises For Computer Science 151

Processing File Data Line-By-Line

Processing File Data Line-By-Line

Goals: This laboratory exercise introduces strategies for processing file data, when the data are organized line-by-line.


Text Files

In contrast to files considered as data streams, text files consider data as organized into lines. For example, the file /home/walker/151s/labs/state-income contains information about the median annual income for a 4-person family for the various states for the years 1997 back to 1979. The first part of the file looks like this:


                         Median Income for 4-Person Families, by State, According to the U.S. Census Bureau 
                        Reported November 3, 1999 on Web site http://www.census.gov/hhes/income/4person.html

Year                  1997  1996  1995  1994  1993  1992  1991  1990  1989  1988  1987  1986  1985  1984  1983  1982  1981  1980  1979

United States        53350 51518 49687 47012 45161 44615 43056 41451 40763 39051 36812 34716 32777 31097 29184 27619 26274 24332 22395
Alabama              48240 44879 42617 41730 37975 39659 37638 35937 34930 33022 31221 29799 28407 26595 25117 24181 22443 22026 18613
Alaska               57474 62078 56045 53555 51181 49632 49721 51538 48411 47247 47106 41292 42897 44017 38238 31823 35834 32745 31037
Arizona              47133 45032 44526 41599 39679 39900 39364 38799 38347 36892 35711 33477 32129 29431 27551 29835 25163 23832 23000
Arkansas             38646 36828 38520 36510 32594 36682 34566 31913 31853 28665 27415 27157 26255 23075 21524 20710 20583 19448 18493
California           55217 53807 51519 48755 44643 46774 46643 45184 42813 41425 40218 37655 36223 33711 31967 29885 27763 26070 25109
Colorado             58988 53632 50941 48801 47112 45021 43136 41803 40265 39095 37778 36026 35214 34154 32294 30663 28756 25943 25228
Connecticut          72706 67380 62157 62107 59288 55061 54479 53931 53313 50720 47195 44330 40677 39070 37703 35361 31108 28376 24410

To clarify this format, the first five lines contain header information (2 lines of title, a blank line, column headings for various years, and another blank line). Thereafter, the information about each state is on a separate line. Within a line, the state name is left justified in the first 21 characters, and income figures are in 6-character-wide columns (the income appears as 5 characters, and a blank spaces separates one year's income figure from the next).

Problem 1:

Three typical questions concerning this data follow:

  1. Extract from this file the name of each state and its median income for 1995. Put the results in a new file, called state-income-for-1995

  2. From this file, extract each state's median income for a given year.

  3. Review the data in this file to determine those states for which the median income decreased from some year to the next.

For example, given the above data, the program for Part A should generate a file that begins as follows:

State          1995 Median Income

United States        49687 
Alabama              42617 
Alaska               56045 
Arizona              44526 
Arkansas             38520 
California           51519 

High-Level Solution Outline: When data are organized line-by-line, a natural solution is given by the following pseudocode:

  1. Open the input file state-income.
  2. Open the output file state-income-for-1995.
  3. Print a table heading to the output file.
  4. Read and discard any header lines in the input file.
  5. Until the end of the file is reached:
    1. Read and process a line of input
  6. Close the input file.
  7. Close the output file.

While steps 1-4 and 6-7 are quite similar to what we have encountered with file streams, step 5 may be done in several ways.

Solution Outline -- Part A (and Part B): In this approach, we read and process a line in four separate steps, as shown in this somewhat more detailed, pseudocode solution:

  1. Open the input file.
  2. Open the output file.
  3. Print a table heading to the output file.
  4. Read and discard the first 5 lines of the input file.
  5. Until the end of the file is reached:
    1. Read a line of input, putting the characters into a string
    2. Print the first 21 characters as the state name
    3. Print characters 32-38 on the line as the 1995 median income.
    4. Print a newline character to the output file.
  6. Close the input file.
  7. Close the output file.

Solution in Scheme:

To read a line, we read character-by-character until reaching the #\newline character or the end of the file:


(define read-line-string
   (lambda (source)
   ;Pre-condition:  source designates a file open for reading
   ;Post-conditions:  returns the next line of the file
   ;                  or an eof-object if the end of the file is encountered
      (let loop ((ch-list '()) (ch (read-char source)))
         (cond ((eof-object? ch) ch)
               ((char=? ch #\newline) (list->string (reverse ch-list)))
               (else (loop (cons ch ch-list) (read-char source)))
         )
      )
   )
)

Steps 5B and 5C are accomplished using the substring procedure. The resulting procedure find-1995-income follows:


(define find-1995-income
   (lambda ()
   ;Pre-conditions:  None
   ;Post-conditions:  State and 1995-income data from the file
   ;                    /home/walker/151s/labs/state-income  to file
   ;                    state-income-for-1995
      (let ((source (open-input-file "/home/walker/151s/labs/state-income"))
            (target (open-output-file "state-income-for-1995")))
         ;print header and skip a line
         (display "State           1995 Median Income" target) 
         (newline target)
         (newline target)

         ;skip over first 5 input lines, as these are titles
         (read-line-string source)  
         (read-line-string source)  
         (read-line-string source)  
         (read-line-string source)  
         (read-line-string source)

         ;begin line-by-line processing
         (let loop ((state-data (read-line-string source)))
             (if (eof-object? state-data)
                 (begin ; finish processing
                     (close-input-port source)   ; close the input file
                     (close-output-port target)) ; and the output file.
                 (begin ; process next line
                     (display (substring state-data 0 21) target)
                     (display (substring state-data 32 38) target)
                     (newline target)
                     (loop (read-line-string source)))
             )
          )
       )
   )
)

When you run this procedure with the simple call (find-1995-income), nothing shows up on screen, because the last operation performed is the call to close-output-port, which returns an unspecified value (and Chez Scheme doesn't bother to print unspecified values). All the action takes place off stage, in the files.

Note that the creation of this file is invisible to the interactive Scheme user.


  1. After running the procedure as shown above, look at the data printed in state-income-for-1995 to be sure it contains what you expect. This might be done in either two ways: you could open the file with an editor such as XEmacs or you could view the file in an hpterm window with the cat or more commands.

  2. Explain how each part of the pseudocode is reflected in the Scheme code.

  3. Trace this program for the data file given. Be sure you can explain how it works.

  4. The solution to Part A may be modified to give a solution to Part B, as follows. Given a specified year, the subtraction 1997-year will indicate how many columns after the state must be skipped before finding the appropriate data. Since the state name occupies 21 spaces and each year occupies 6 characters, knowing the year's column allows you to compute where the income for any specified year is found.

    Modify find-1995-income to get a procedure find-year-income, which has a year as parameter and which extracts the relevant data from each line using substring for a computed starting and ending point.


Solution Outline -- Part C:

A second approach for line-by-line processing involves reading an entire line and then extracting the parts relevant to the problem. The main steps are given in the following pseudocode, which expands step 3 of the high level solution outline shown earlier.

  1. Open the input file.
  2. Until the end of the file is reached:
    1. Read a line of input as a string of characters
    2. Start with the first number on the line (starting from column 21)
    3. Until no more numbers appear on the line or a decrease is noted
      1. Extract the next number from the line
      2. If the new number is larger than the previous one,
        print the name and stop
        otherwise, continue using this new number for comparison
  3. Close the input file.

Solution in Scheme: This approach requires the introduction of no new Scheme procedures. Thus, we focus on step 2 for reading a line and extracting successive numbers. In steps 2B and 2C, the main work is to retrieve a number or word from the beginning of the string which has been read. One way to accomplish this is to first scan the start of the string, throwing away any white space (e.g., spaces or tabs). Then, we can continue scanning until we find the end of the word. In writing out the details, we may scan positions within the string read, or we may change the string to a list and extract the relevant parts of the list.

When we work within a string, it is convenient to have one procedure, find-word-start which scans character-by-character to find the beginning of a number or word. A second procedure, find-word-end then continues the search to find the end of that word. The following code illustrates this approach:

(define get-word
   (lambda (str)
   ;Pre-condition:  str is a character string
   ;Post-condition:  returns the first word in str, skipping over white space
      (letrec ((find-word-start  ; find starting index of first word in str
                  (lambda ()
                    (let loop ((index 0) (len (string-length str)))
                       (cond ((= index len) index)
                             ((char-whitespace? (string-ref str index))
                                   (loop (+ index 1) len))
                             (else index)
                        )
                    )
                  ))
               (find-word-end   ; find ending index of first word in str
                  (lambda (start-index)
                    (let loop ((index start-index) (len (string-length str)))
                        (cond ((= index len) index)
                              ((char-whitespace? (string-ref str index)) index)
                              (else (loop (+ index 1) len))
                        )
                    )
                  ))
                )
        (let* ((first-word-index (find-word-start))
               (end-word-index (find-word-end first-word-index)))
           (substring str first-word-index end-word-index)
        )
      )
   )
)

An alternative coding for get-word converts the string to a list. First, an outer named let expression removes initial white space. Once the first character of the number or word is found, we collect characers until the end of the number or word is reached.

(define get-word
   (lambda (str)
   ;Pre-condition:  str is a character string
   ;Post-condition:  returns the first word in str, skipping over white space
      (let find-first-nonwhite ((lst (string->list str)))
         (cond ((null? lst) "")
               ((char-whitespace? (car lst))
                    (find-first-nonwhite (cdr lst)))
               (else (let getword ((old-lst lst) (new-lst '()))
                         (if (or (null? old-lst)
                                 (char-whitespace? (car old-lst)))
                             (list->string (reverse new-lst))
                             (getword (cdr old-lst) 
                                      (cons (car old-lst) new-lst))
                         )
                     ))
         )
      )
   )
)


  1. Check each of these versions of get-word works correctly by testing them on the strings "12345 67890 a bc def" and " 3.141592 pi 2.71828 ".


Analogously to get-word, we may define a procedure chop-word which takes a string as parameter and which returns the string with the first word removed. Thus, chop-word would yield the following results:

(chop-word "12345 67890 a bc def") ==> " 67890 a bc def"
(chop-word "   3.141592    pi 2.71828  ") ==> "    pi 2.71828  "


  1. Write two versions of chop-word, using the two approaches to get-word as a model.

    Note: To get chop-word from the string version of get-word requires a change in only one line. Also, the list version of chop-word can be considerably simpler than the corresponding version of get-word -- at least for the getword named let expression.


Performing step 2 combines read-line-string, get-word and chop-word, as follows:

(define process-line
  (lambda (line)
  ;Pre-condition:  string line contains state name and median family incomes
  ;Post-condition:  state is printed if median income decreased
    (let* ((state-name (substring line 0 21))
           (number-str (substring line 21 (string-length line))))
       (let loop ((base-number (string->number (get-word number-str)))
                  (rest (chop-word number-str)))
           (let ((next-num-str (get-word rest)))
              (cond ((zero? (string-length next-num-str)) "done")
                    ((< base-number (string->number next-num-str))
                         (display state-name)
                         (newline)
                         "done")
                    (else (loop (string->number next-num-str)
                                (chop-word rest)))
              )
           )
       )
    )
  )
)
With all of this work done by process-line and the helping procedures already discussed, the main procedure is quite straightforward:
(define income-decreased
  (lambda (source-file-name)
  ;Pre-condition:  given file contains median state income data by year
  ;Post-condition:  states printed if median income decreased for some year
    (let ((source (open-input-file source-file-name)))  ; Open the input file.
       ;print header and skip a line
       (display "States with Decreased Median From Some Year to the Next") 
       (newline)
       (newline)

       ;skip over first 5 input lines
       (read-line-string source)  
       (read-line-string source)  
       (read-line-string source)  
       (read-line-string source)  
       (read-line-string source)

       ;begin line-by-line processing
       (let loop1 ((line (read-line-string source)))     ; Read next line
         (if (eof-object? line)               ; If you get the eof-object,
             (close-input-port source)        ; then close the input file
             (begin
                (process-line line)           ; otherwise, scan the line
                (loop1 (read-line-string source)))  ; and repeat
         )
       )
    )
  )
)
  1. Check that the resulting program works correctly by using the command:

    (income-decreased "/home/walker/151s/labs/state-income")
    
  2. Modify process-line to print each state and each year in which the median family decreased.

As a variation of this program, we might choose to read and process a line within process-line itself, so that this procedure seems more self contained. The revised code would be:

(define process-line
  (lambda (source)
  ;Pre-condition:  source is an open file, where a  line contains 
  ;                     state name and median family incomes
  ;Post-condition:  state is printed if median income decreased
    (let* ((line (read-line-string source))
           (state-name (substring line 0 21))
           (number-str (substring line 21 (string-length line))))
       (let loop ((base-number (string->number (get-word number-str)))
                  (rest (chop-word number-str)))
           (let ((next-num-str (get-word rest)))
              (cond ((zero? (string-length next-num-str)) "done")
                    ((< base-number (string->number next-num-str))
                         (display state-name)
                         (newline)
                         "done")
                    (else (loop (string->number next-num-str)
                                (chop-word rest)))
              )
           )
       )
    )
  )
)

In this approach, we must have a way for the main income-decreased procedure to determine when another line is present before calling process-line. This is accomplished with a new Scheme procedure peek-char. The peek-char procedure takes one argument, an input port, and returns the first unread character that can be accessed through that port. It does not actually read that character or extract it from the port -- it just peeks at it to see what it will be when and if it is (subsequently) read in.

If the value returned by (peek-char source) is an end-of-file object, then we know that we're at the end of the file, and we can finish processing.

The revised income-decreased procedure follows:

(define income-decreased
  (lambda (source-file-name)
  ;Pre-condition:  given file contains median state income data by year
  ;Post-condition:  states printed if median income decreased for some year
    (let ((source (open-input-file source-file-name)))  ; Open the input file.
       ;print header and skip a line
       (display "States with Decreased Median From Some Year to the Next") 
       (newline)
       (newline)

       ;skip over first 5 input lines
       (read-line-string source)  
       (read-line-string source)  
       (read-line-string source)  
       (read-line-string source)  
       (read-line-string source)

       ;begin line-by-line processing
       (let loop1 ((ch (peek-char source)))   ; Read next line
         (if (eof-object? ch)                 ; If you get the eof-object,
             (close-input-port source)        ; then close the input file
             (begin
                (process-line source)         ; otherwise, scan the line
                (loop1 (peek-char source)))   ; and repeat
         )
       )
    )
  )
)
  1. Why is peek-char needed in this version of income-decreased? What happens if peek-char is replaced by read-char? Explain.


In developing the solution to Problem 1C, we identified several useful procedures, including read-line-string, get-word chop-word, and peek-char. With minor changes in process-line and income-decreased, a similar approach may be used to solve a wide range of file-based problems. The following provides another example:

Problem 2:

The file /home/walker/151s/labs/ia-cities.dat contains information about the sixty largest cities and towns in Iowa: their names and populations, as determined by the 1990 and 1980 censuses. A portion of the file looks like this:


Fairfield         9768    9428
Fort Dodge       25894   29423
Fort Madison     11614   13520
Grinnell          8902    8868
Independence      5972    6392
Indianola        11340   10843
Iowa City        59735   50508

To clarify this format, each line of the file contains information about a specific town or city. Within a line, the city name appears left-justified within the first 16 characters of the line, the population from the 1990 census appears right justified in columns 17 through 22, and the population from the 1980 census appears in columns 25-30. Columns 23 and 24 are always blank.

Write a program which reads each city name and which determines the percentage increase or decrease of the population from 1980 to 1990. Print the results in a table as follows:


                Percent
City             Change

.
.
.

Fairfield       3.61
Fort Dodge      -11.99
Fort Madison    -14.1
Grinnell        0.38

Solution Outline: Since this problem asks for results to be printed to the screen rather than placed in a file, a solution need not consider an output file. Otherwise, a general solution can the same overall form as used previously for text files. The pseudocode follows:

  1. Open the input file.
  2. Print titles for the table.
  3. Until the end of the file is reached:
    1. Read a line of input as a string of characters
    2. From the line, extract:
      1. city name (first 16 characters)
      2. 1990 census figure (next number)
      3. 1980 census figure (final number)
    3. Compute amount of population change.
    4. Compute percentage population change.
    5. Write the city name and perchange change.
  4. Close the input file.

Solution in Scheme: The implementation in Scheme requires changes only in process-line and income-decreased. Procedure process-line is revised to follow step 3 of the above outline for a specific city. In income-decrased, we remove the steps to skip header lines, change the printing of a title, and change the procedure's name to the more appropriate compute-city-changes.

The revised procedures follow:

(define process-line
   (lambda (source)
      (let* ((str (read-line-string source))
             (city-name (substring str 0 16))
             (after-city (substring str 17 (string-length str)))
             (pop-1990 (string->number (get-word after-city)))
             (after-90 (chop-word after-city))
             (pop-1980 (string->number (get-word after-90)))
             (pop-change (- pop-1990 pop-1980))
             (percent-change (* 100.0 (/ pop-change pop-1980)))
             (formatted-per-chg (/ (round (* 100.0 percent-change)) 100.0))
            )
        (display city-name)
        (display formatted-per-chg)
        (newline)
      )
   )
)

(define compute-city-changes
   (lambda (source-file-name)
    (let ((source (open-input-file source-file-name)))
      (newline)
      (display "               Percent")   ; Print table titles
      (newline)
      (display "City            Change")
      (newline)

      (let loop1 ((ch (peek-char source))) ; Peek at the next character.
        (if (eof-object? ch)               ; If you get the eof-object,
            (close-input-port source)      ; close the input file
            (begin                         ; Otherwise, process line
              (process-line source)
              (loop1 (peek-char source)))
        )
      )
    )
  )
)
  1. Check that the resulting program works correctly by using the command:

    (compute-city-changes "/home/walker/151s/labs/ia-cities.dat")
    
  2. Within process-line, the computation for city-name uses the substring procedure, rather than get-word. Could get-word be used instead? Briefly justify your answer.

  3. Write a brief description of how process-line handles the computations for a city. Be sure your discussion includes a comment about the computation for variable formatted-per-chg.

  4. Modify compute-city-changes and/or process-line, so that, for each city, it prints the city, the 1990 and 1980 census figures, and the percentage population change.

    Hint: This is extremely easy!

  5. Modify compute-city-changes and/or process-line, so that the program prints the names of only those cities whose populations increased between 1980 and 1990.

Work to be turned in:


This document is available on the World Wide Web as

http://www.math.grin.edu/~walker/courses/151.sp00/lab-files-as-lines.html

created April 4, 1999 by Clif Flynt and Henry M. Walker
last revised February 29, 2000 by Henry M. Walker