InfoVis 2003 Contest

Datasets - Descriptions - Tasks

Background information about the File System and Usage Log data

Description

This tree represents the file system of the University of Maryland Computer Science Department website (only the public files that are accessible via the web are included.) Leaf nodes are files, subtrees are directories. Each node has a set of attributes:
name
The file or directory name.
hitCount
The number of time this file has been accessed via the web during a 1 week period.
size
The size in bytes of the file.
uid
The User Id of that file, representing the user owning that file.
gid
The Group Id of that file, representing the Unix group the file belongs to.
mtime
The last modification time of that file, in standard unix representation (number of seconds since January 1st 1970.)
ctime
The creation time of that file.
title
The contents of the TITLE tag contained in that file, if appropriate and present.

Specific tasks

Questions about the file system itself: where are the big directories, can you see different patterns in those files? e.g. can you make out the difference between personal pages, class pages and research project pages, were there a lot of pages created recently, in which part of the file system? Are the newer directories bigger than the older projects? When was the page giving directions to the department last updated?

Questions about usage: Which are the popular webpages? Are there some labs more popular than others? Which areas are getting more popular? less popular? Are new pages more popular that old pages? Which old page are popular? What proportion of the pages are never used? seldom used?
NOTE: If you really have to use a subset of the tree because you cannot handle so many nodes, work on the "HCIL" subtree (i.e. everything under /projects/hcil).

How the datasets were generated

Finding the url of files accessible at the University of Maryland has been done with the wget program using the following command: wget -r -nv -ofile.log --remove-after --domains=www.cs.umd.edu http://www.cs.umd.edu/

Accessible URLs have been extracted from the generated log file. Based on the rules of the web server, they have been translated from an URL to a file name.

All these files have been scanned to produce the attributes stored in the XML files. Web log access data has been added.

Since this process has been done once a week, some files have been removed or displaced on the file system before the scanning was performed and no reliable information is available for them.

Return to InfoVis 2003 Contest
Return to InfoVis 2003 Contest - Materials

Web Accessibility