InfoVis 2004 Contest
WilmaScope Graph Visualisation

Contest webpage:

Authors and Affiliations:

Adel Ahmed, University of Sydney, aahmed@it.usyd.edu.au
Tim Dwyer, University of Sydney, dwyer@it.usyd.edu.au
Colin Murray, University of Sydney, cmurray@it.usyd.edu.au
Le Song, University of Sydney, lesong@it.usyd.edu.au
Ying Xin Wu, University of Sydney, Christine.Wu@nicta.com.au

Tool(s):

The WilmaScope graph visualisation system was used to find a 3D layout for the graphs generated. Colin Murray implemented a fast, interactive force-directed layout engine for Wilma based on the FADE algorithm [6]. Our extensions to the algorithm provide improved layouts for scale-free networks.

3DS Max 6 was used to generate the 3D images and animations. Adel Ahmed created the scripts for parsing the Wilma xml files and generating the 3DS Max models and animations.

TASK 1: Static Overview of 10 years of Infovis

Process:
Initial visualisations of the InfoViz04 citation network revealed that it was a scale-free network.
Traditionally, social networks are modeled as random networks where the degrees of nodes in the network follow a bell-shaped Poisson distribution. However, random network models fail to explain the existence of hubs and cliques in the network. The reason is that random network models don't capture two important factors in the formation of social networks: growth and preferential attachment [1, 2]. These two mechanisms help to explain the existence of hubs and cliques: as new nodes appear, they tend to connect to the more connected nodes, which results in the power-law distribution of the node degree. Such network is also called scale-free network which is abundant in nature and society, describing such diverse systems as the WWW, chemical network of cell and collaboration network. For the contest data, we have constructed citation network, cocitation network and coauthor network from the raw data. Statistics in their degree distribution show that they conform to the definition of scale-free network. For scale-free network, hubs in the network have special importance [3, 4], for example under malicious "attack" [5] (it may assume different meaning under different context). So in our layout, it's reasonable to emphasize those hubs in the networks.
To visualise this graph we used a "Two and a Half Dimensional" approach, constraining nodes to lie in layers (or parallel planes). Each node in the citation network represents one paper. The assignment of nodes to layers was done as follows:
- nodes with incoming degree (number of citations by other papers) >=20 were assigned to the top most layer
- nodes with incoming degree 10-19 were assigned to the next layer down.
- nodes with incoming degree < 10 other authors
To find a suitable layout for this reasonably large graph (with XXX nodes) in an interactive time we implemented a force directed algorithm using a spatial decomposition method [6] to reduce the complexity of calculating the repulsive forces between every node (normally this would be O(n^2) time per iteration).

[1] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks . Science 286, 509-512 (1999).
[2] A.-L. Barabási, E. Ravasz and T. Vicsek. Deterministic scale-free networks. Physica A 299, 559-564 (2001).
[3] Albert-László Barabási and Zoltán N. Oltvai. Network Biology: Understanding the Cells's Functional Organization. Nature Reviews Genetics 5, 101-113 (2004).
[4] Albert-László Barabási and E. Bonabeau. Scale-free Networks. Scientific American 288, 60-69 (2003).
[5] R. Albert, H. Jeong, and A.-L. Barabási. Error and attack tolerance in complex networks. Nature 406 , 378 (2000).
[6] A. Quigley. Large Scale Relational Information Visualization, Clustering, and Abstraction. PhD thesis, University of Newcastle, Australia, 2001.

Image 1.1 :
Insight:
The image shows the entire citation network. Hubs, or high degree nodes are constrained to lie on the higher layers and the force directed layout causes these highly connected nodes to be reasonably centred resulting in the "volcano" like effect. The zoomed section identifies 7 of the most highly cited papers.
Caption for exhibit:
The image shows the entire citation network. Hubs, or high degree nodes are constrained to lie on the higher layers and the force directed layout causes these highly connected nodes to be reasonably centred resulting in the "volcano" like effect. The zoomed section identifies 7 of the most highly cited papers.

Other Images (optional):
Image 1.2 :
Insight:
Each "worm" in this image shows the evolution of a different research area. Each level corresponds to the papers published in a particular year (1993 at the bottom 2004 at the top). Edges indicate citations papers in one research area in a particular year to papers in another research area. The thickness of each worm segment indicates the number of papers published in the corresponding year in that research area. Worms lean and bend towards each other when they are strongly coupled (there are many citations between them), otherwise they bend apart.
Image 1.3:
Insight:

In our movie we show an extensive tour of our various views of the co-author network, focusing on these cliques and clusters and highlighting prominant authors.

TASK 2: Characterize the research areas and their evolution

Process:
The first step was classifying the different research areas. We implemented a spherical self-organizing-map (SOM) to do the paper clustering. The vector data used to train the SOM were calculated from three different methods:
- The first method we tried was to classify the paper by the word histogram. Each document was transformed to a vector based on the frequency of words in its title, abstract and keywords. We used a stemmer to change the words into their morphological root and a stop word list was used to eliminate the meaningless words such as "the", "some", etc. After that, the most popular words ("information", "visualization", "data" etc.) and the words appearing less than 5 times in total were eliminated. There were 827 words remaining so that each vector was 827 dimensional. Some papers were also removed because none of their words remained after the above preprocessing. In order to increase the accuracy, we weighed each word according to the term frequency / inverse document frequency. The clustering was then done by a 42 neuron spherical self-organizing map.
- Based on the word histogram matrix, we also tried to do the clustering by latent semantic analysis. We used Matlab to find the most important singular values and singular vectors of the word histogram matrix. By projecting the original vectors to the first 200 most important singular vectors, the vector dimension is dramatically reduced to 200 which speed up the SOM training significantly. However, the clustering results were not as good as the first method.
- The last method we tried was to train the SOM using a cocitation matrix. Again each paper was represented by a vector. The elements in each vector were the number of times the paper was cocited with other papers. However, the clustering result using the SOM was not satisfying.
Using the first of the above research methods we identified the following research areas:
- Database Query and Data Mining
- System Design
- Web Data
- Interaction
- Graph Drawing
- Focus + Context Techniques
- Software Visualization
- Hierarchy
- Multidimensional Data Analysis
- Trees
- Text and Image Information Retrieval
We then created a graph (network) of citations between research areas in different years. That is, we created a graph G=(V,E) where each node (u in V) represents all the papers published in a particular year in one of the above research areas and each directed edge ((u,v) in E) indicates a citation in a paper u of a paper in v.
Before visualising this graph we created additional edges between pairs of nodes representing sequential years in the same research area. Thus all nodes in each research area formed a chain ("worm"), ordered by year of publication. An embedding in 3D space was found for the graph using a force directed method, with edges in each worm having much stiffer springs than the springs representing citation edges. The positions of the nodes are constrained, such that all nodes for a given year must lie in the same plane. The planes corresponding to each year are arranged parallel to one another, and perpendicular to the z-axis, such that the top most plane (when viewed from the side) corresponds to the most recent year (2004) and the bottom most corresponds to the earliest (1993 and earlier).
Finally, the scene was rendered with spherical nodes of radius proportional to the number of papers published in the corresponding research area and year. Edges linking adjacent nodes in each worm are represented by cylinders with end radii matching that of their end nodes, creating the smooth "worm" effect. The research area worms are given distinctive colours, uniform within each worm, except where no paper was published in a particular year. In this last case the worm segment is given a darker shade. Citation edges between worms were represented with a 3D arrow of radius and colour intensity proportional to the number of citations represented.
In our interactive tool (WilmaScope) we use a cross-sectional viewer to study citations from individual years. In the movie (produced with 3DSMax), we show an animation, stepping through the citation edges emanating from a particular year. This revealed an interesting problem in the data set, namely citations from earlier papers to newer papers. We discuss this in more detail below.

Image 2.1:
Insight 2.1:
This first layer in our worm visualisation shows citations from papers in each of the 11 research areas, to papers in other research areas published in 1993 and before. The relative size of the nodes indicates that "Database Query and Data Mining" and "System Design" are the most popular research areas at this time (most papers in each). A small distance between nodes, determined by the force directed algorithm attempting to minimise the edge lengths of citation edges, tends to indicate research areas that are more frequently referenced by other research areas. For example, the fat edge from "Software Visualization" to "Graph Drawing" indicates a strong link between these two fields. Papers in "Multidimensional Data Analysis" seem to be cited by many different fields. In contrast "Text and Image Information Retrieval" seems to be a relative outlier, only occasionally cited by papers in the "Multidimensional Data Analysis" and "Database Query and Data Mining" research areas.
Image 2.2:
Insight 2.2:
We now add another layer of nodes showing papers published by research area in 1994. We also add edges for citations made in 1994 and remove the edges associated with earlier citations. We can see in general that there are fewer papers published in this category. This is to be expected since the previous layer included papers not only from 1993 but also all earlier papers back to the 1980s. Interaction is still the most popular research area, and the small dimmed node for "Web Data" (at the left rear) indicates that there were no papers in the data-set for this research area in 1994.
Image 2.3:
Insight 2.3:
In 1995 "Text and Image Retrieval" has increased significantly in popularity and "Database Query and Data Mining" somewhat less so. Citations from "Software Visualization" papers seem to be leaning (literally the worm is leaning over) towards "System Design" and "Focus and Context Techniques".
A strange phenomenon has also become evident. A number of upward pointing edges have appeared. These can only be citations of future papers! Closer investigation revealed that many of these were due to older papers being reprinted in compilations and the year of publication for these papers being set to that of the more recently published volume, for example:
S. G. Eick and G. J. Wills. "Navigating large networks with hierarchies". Proceedings Visualization'93, IEEE, October 1993.
Was reprinted in "Readings in information visualization using vision to think" in 1999, and this is the definitive entry for article id="acm300752", even though this id is cited by papers from as early as 1994.
This struck us as an example of the power of visualisation. The problem in the data-set was made blatantly obvious by the visualisation, but had we been simply performing textual queries on a database it may not have become evident unless we specifically designed a query to look for it.
Image 2.4:
Insight 2.4:
Our movie shows a smooth animation of every successive year. But, for brevity's sake we skip ahead to 2001. Notably, our own favourite area, "Graph Drawing" and the related field "Trees" have shown an increase in interest. Indeed, healthy cross citation between these two fields is evident. The "Hierarchy" worm is leaning back towards trees due to a number of citations. By contrast, interaction, has waned... with no publications in 2001 and the worm (orange) is noticeable leaning outward and with few citations of more recent papers.
Image 2.5:
Insight 2.5:
Finally, we skip ahead to the most recent entries in the data-set, from 2004. Understandably, not much is available for 2004, with only one paper in the "Trees" area. However, the data-set also seems quite incomplete for 2003.

TASK 3: The people in InfoVis

Task 3.1: Where does a particular author/researcher fit within the research areas defined in task 2?

Process:
For this task we defined a new "coauthor" graph with nodes representing authors and edges created between author nodes whenever they appeared together in the list of authors for a particular paper. One edge can represent a number of coauthor relationships between two authors, where a weight is assigned to the edge according to the number of papers represented. A second set of nodes in this graph is added for each of the research areas, with an edge created between each author and each of the research areas in which they have published papers. The weight of this second type of edge indicates the number of papers they have published.
To visualise this graph we, again, used a "Two and a Half Dimensional" approach, constraining nodes to lie in layers (or parallel planes). The assignment of nodes to layers was done as follows:
- nodes with coauthor degree (number of coauthor relationships) >=20 were assigned to the top most layer
- nodes with coauthor degree 10-19 were assigned to the next layer down.
- nodes with coauthor degree < 10 other authors
- the 11 nodes representing clusters are assigned to the lowest layer
Edges connecting author nodes to research area nodes were rendered as spikes on the surface of the author node, pointing towards the research area node. The colour of each spike matches the colour of the research area node and the length indicates the number of papers the author has published in that particular research area. For example see Image 3.1.1. Thus, an authors contribution to a specific research area is shown locally, by the various spikes, and globally by the author's position relative to the research area node. Initially we showed this relationship with an edge rendered as a line between the author and research area nodes, but we felt that the spike showed the same information with less clutter.
When arranged this layered graph with two styles of nodes. The first uses planar layers, producing a "mountain range" style effect. Important, high degree nodes literally stand out above less well connected nodes.
The alternative layout arranges the layers as concentric, spherical shells, with the import high degree nodes, on the outer-most layer. This spherical alternative possibly offers an alternative method for navigating the data set, with a natural "focus and context" metaphor. Mostly, however, this was done for aesthetic appeal.
Image 3.1.1:
Insight:
Image 3.1.1 shows the counts of papers published by George G. Robertson by research area. Clearly, "interaction" and "focus + context techniques" are his speciality areas, but his writing has a healthy spread across a number of other areas also. Shown this way, the graph is basically equivalent to a histogram. The advantage of this representation is that the spikes can also be displayed on nodes, in the context of a much larger graph with many author nodes, for example see Image 3.1.2.
Image 3.1.2:
Insight:

This is an extreme close-up of G. Robertson, so it's difficult to see much of the rest of the graph. One notable point is that G. Robertson is in the top-most layer, indicating he has published papers with at least 20 other authors.

Task 3.2: What, if any, are the relationships between two or more or all researchers?

Process:
We perform this task by further examining the same coauthor graph used in 3.1.
Image 3.2.1:
Insight:
Moving the camera away from the G. Robertson node we can see a neighbourhood of authors on the same 20+ degree layer who are strongly related to G. Robertson, namely S. Card, J. Mackinlay and P. Pirolli.
Image 3.2.2:
Insight:

In our movie we show an extensive tour of our various views of the author network, focusing on these cliques and clusters and highlighting prominant authors.

OTHER TASKS (optional)

COMMENTS (optional)

Web Accessibility

InfoVis 2004 Contest WilmaScope Graph Visualisation