Last Updated: 2024-04-21 Sun 20:46

CMSC216 HW12: Binary I/O, Memory Mapping and pmap

CODE DISTRIBUTION: hw12-code.zip

CHANGELOG: Empty

1 Rationale

Files are often stored in "binary format" for efficiency of storage and access. Rather than more familiar formatted text formats, these formats require use of binary file I/O to manipulate them, frequently low level Unix read() / write() calls. They also often require jumping to different positions in the file which can be done via the lseek() system call. These are explored in this HW.

A viable alternative to file I/O is to make use of memory mapped files through mmap(). This utilizes a system call to expose files as a pointer into operating system managed space which holds parts of the file in main memory. While equivalent in power to standard I/O, mmap() avoids the need for intermediate buffers and allows pointer arithmetic to be used to locate and alter the file.

On modern computing systems, virtual memory creates the illusion that every program has a linear address space from 0 to some large address. Mostly this happens behind the scenes and is managed by the operating system but knowledge of presence of virtual addresses provides insight into many aspects of practical programming. One can inspect some of the OS information on the virtual address space of a program using utilities such as pmap.

Associated Reading / Preparation

Bryant and O'Hallaron Ch 10 covers basic I/O functions like read() / write() as well as lseek(). These functions work equally as well for text and binary data.

Bryant and O'Hallaron: Ch 9 on Virtual Memory is informative for it's coverage virtual memory in general. The mmap() function is discussed in section 9.8.4. The overview of virtual memory is useful to understand the output of pmap.

Grading Policy

Credit for this HW is earned by taking the associated HW Quiz which is linked under Gradescope. The quiz will ask similar questions as those that are present in the QUESTIONS.txt file and those that complete all answers in QUESTIONS.txt should have no trouble with the quiz.

Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.

See the full policies in the course syllabus.

2 Codepack

The codepack for the HW contains the following files:

File Description
QUESTIONS.txt Questions to answer
binfiles-mmap/ Directory for Problems 1-2
Makefile Makefile to build Problem 1-2 programs
department.h Header file for programs
make_dept_directory.c Problem 1-2 program to create data file
cse_depts.dat.bk Backup of data file created in Problem 1-2
print_department_read.c Problem 1 program to analyze
print_department_mmap.c Problem 2 program to analyze
memory-parts/ Directory for Problem 3
Makefile Makefile to build programs for the HW
memory_parts.c Problem 3 program to analyze
gettysburg.txt Problem 3 data file

3 Questions

Analyze the files in the provided codepack and answer the questions given in QUESTIONS.txt.

                           _________________

                            HW 12 QUESTIONS
                           _________________


Write your answers to the questions below directly in this text file to
prepare for the associated quiz. Credit for the HW is earned by
completing the associated online quiz on Gradescope.


PROBLEM 1: Binary File Format w/ Read
=====================================

(A)
~~~

  Compile all programs in the directory `binfiles/' with the provided
  `Makefile'.  Run the command
  ,----
  | ./make_dept_directory cse_depts.dat
  `----

  to create the `cse_depts.dat' binary file. Examine the source code for
  this program along with the header `department.h'.
  - What system calls are used in `make_dept_directory.c' to create this
    file?
  - How is the `sizeof()' operator used to simplify some of the
    computations in `make_dept_directory.c'?
  - What data is in `cse_depts.dat' and how is it ordered?


(B)
~~~

  Run the `print_department_read' program which takes a binary data file
  and a department code to print.  Show a few examples of running this
  program with the valid command line arguments. Include in your demo
  runs that
  - Use the `cse_depts.dat' with known and unknown department codes
  - Use a file other than `cse_depts.dat'


(C)
~~~

  Study the source code for `print_department_read' and describe how it
  initially prints the table of offsets shown below.
  ,----
  | Dept Name: CS Offset: 104
  | Dept Name: EE Offset: 2152
  | Dept Name: IT Offset: 3688
  `----
  What specific sequence of calls leads to this information?


(D)
~~~

  What system call is used to skip immediately to the location in the
  file where desired contacts are located? What arguments does this
  system call take? Consult the manual entry for this function to find
  out how else it can be used.


PROBLEM 2: mmap() and binary files
==================================

  An alternative to using standard I/O functions is "memory mapped"
  files through the system call `mmap()'. The program
  `print_department_mmap.c' provides the functionality as the previous
  `print_department_read.c' but uses a different mechanism.


(A)
~~~

  Early in `print_department_mmap.c' an `open()' call is used as in the
  previous program but it is followed shortly by a call to `mmap()' in
  the lines
  ,----
  |   char *file_bytes =
  |     mmap(NULL, size, PROT_READ, MAP_SHARED,
  |          fd, 0);
  `----
  Look up reference documentation on `mmap()' and describe some of the
  arguments to it including the `NULL' and `size' arguments. Also
  describe its return value.


(B)
~~~

  The initial setup of the program uses `mmap()' to assign a pointer to
  variable `char *file_bytes'.  This pointer will refer directly to the
  bytes of the binary file.

  Examine the lines
  ,----
  |   ////////////////////////////////////////////////////////////////////////////////
  |   // CHECK the file_header_t struct for integrity, size of department array
  |   file_header_t *header = (file_header_t *) file_bytes; // binary header struct is first thing in the file
  `----

  Explain what is happening here: what value will the variable `header'
  get and how is it used in subsequent lines.


(C)
~~~

  After finishing with the file header, the next section of the program
  begins with the following.
  ,----
  |   ////////////////////////////////////////////////////////////////////////////////
  |   // SEARCH the array of department offsets for the department named
  |   // on the command line
  | 
  |   dept_offset_t *offsets =           // after file header, array of dept_offset_t structures
  |     (dept_offset_t *) (file_bytes + sizeof(file_header_t));
  | 
  `----

  Explain what value the `offsets_arr' variable is assigned and how it
  is used in the remainder of the SEARCH section.


(D)
~~~

  The final phase of the program begins below
  ,----
  |   ////////////////////////////////////////////////////////////////////////////////
  |   // PRINT out all personnel in the specified department
  |   ...
  |   contact_t *dept_contacts = (contact_t *) (file_bytes + offset);
  `----
  Describe what value `dept_contacts' is assigned and how the final
  phase uses it.


PROBLEM 3: Virtual Memory and pmap
==================================

  Code for this problem is in the `memory-parts' subdirectory.


(A) memory_parts memory areas
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Examine the source code for the provided `memory-parts/memory_parts.c'
  program. Identify what region of program memory you expect the
  following variables to be allocated into:
  - global_arr[]
  - stack_arr[]
  - heap_arr


(B) Running memory_parts and pmap
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Compile the `memory_parts' using the provided Makefile.
  ,----
  | > make memory_parts
  `----
  Run the program and note that it prints several pieces of information
  - The addresses of several of the variables allocated
  - Its Process ID (PID) which is a unique number used to identify the
    running program. This is an integer.
  For example, the output might be
  ,----
  | > ./memory-parts
  | 0x5605a7c271e9 : main()
  | 0x5605a7c2a0c0 : global_arr
  | 0x7ffe5ff7d600 : stack_arr
  | 0x5605a92442a0 : heap_arr
  | 0x7f1fa7303000 : mmap'd file
  | 0x600000000000 : mmap'd block1
  | 0x600000001000 : mmap'd block2
  | my pid is 8406
  | press any key to continue
  `----
  so the programs PID is 8406

  The program will also stop at this point until a key is pressed. DO
  NOT PRESS A KEY YET.

  Open another terminal and type the following command in that new
  terminal.
  ,----
  | > pmap THE-PID-NUMBER-THAT-WAS-PRINTED-EARLIER
  `----

  Paste the output of pmap below.


(C) Program Addresses vs Mapped Addresses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  pmap prints out the virtual address space table for the program. The
  leftmost column is a virtual address mapped by the OS for the program
  to some physical location.  The next column is the size of the area of
  memory associated with that starting address. The 3rd column contains
  permissions of the program has for the memory area: r for read, w for
  read, x for execute. The final column is contains any identifying
  information about the memory area that pmap can discern.

  Compare the addresses of variables and functions from the paused
  program to the output. Try to determine the virtual address space in
  which each variable resides and what region of program memory that
  virtual address must belong to (stack, heap, globals, text).  In some
  cases, the identifying information provided by pmap may make this
  obvious.


(D) Min Size of Mapped Areas
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The minimum size of any virtual area of memory appears to be 4K. Why
  is this the case?


(E) Additional Observations
~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Notice that in addition to the "normal" variables that are mapped,
  there is also an entry for the mmap()'d file 'gettysburg.txt' in the
  virtual address table.  The mmap() function is explored in the next
  problem but note its calling sequence which involves use of a couple
  system calls:
  1. `open()' which is a low level file opening call which returns a
     numeric file descriptor.
  2. `fstat()' which obtains information such as size for an open file
     based on its numeric file descriptor. The `stat() / fstat()' system
     calls are used to ask the Unix Operating System information about
     files such as their size, modification times, and access
     permissions.  This system call is studied more in Operating System
     courses.

  Finally there are additional calls to `mmap()' which allocate memory
  to the program at a specific virtual address. Similar code to this is
  often used to allocate and expand the heap area of memory for programs
  in implementations of `malloc()'.

Author: Chris Kauffman (kauffman@umn.edu)
Date: 2024-04-21 Sun 20:46