InfoVis 2004 Contest
IN-SPIRE InfoVis 2004 Contest Entry

Contest webpage:

Authors and Affiliations:

Pak Chung Wong, Pacific Northwest National Laboratory, pak.wong@pnl.gov
Beth Hetzler, Pacific Northwest National Laboratory, beth.hetzler@pnl.gov
Christian Posse, Pacific Northwest National Laboratory, christian.posse@pnl.gov
Mark Whiting, Pacific Northwest National Laboratory, mark.whiting@pnl.gov
Susan Havre, Pacific Northwest National Laboratory, sue.havre@pnl.gov
Nick Cramer, Pacific Northwest National Laboratory, nick.cramer@pnl.gov
Anuj Shah, Pacific Northwest National Laboratory, anuj.shah@pnl.gov
Mudita Singhal, Pacific Northwest National Laboratory, mudita.singhal@pnl.gov
Alan Turner, Pacific Northwest National Laboratory, alan.turner@pnl.gov
Jim Thomas, Pacific Northwest National Laboratory, jim.thomas@pnl.gov

Tool(s):

IN-SPIRE

Pacific Northwest National Laboratory

website

TASK 1: Static Overview of 10 Years of InfoVis

Process:
IN-SPIRE can read free-formed ASCII text as well as most of the standard data formats, including XML and HTML. The XML tags in the contest data allow IN-SPIRE to query different combinations of fields to generate visualizations, query topics, and explore evidence. We used an in-house data cleanup tool to standardize the dates recorded in different formats and to mark as symposium papers the documents with source tags that matched any of the eight InfoVis proceedings titles.
IN-SPIRE uses the unstructured text content of documents (e.g., the "abstract" field of the contest dataset) to identify an interesting set of words known as topics or themes that can be used to distinguish clumps of similar documents within the collection. This process is based on the particular word patterns in the collection at hand and does not transfer to other collections. The co-occurrence or lack of co-occurrence of these interesting words and other statistically associated words in documents is used to build a richer thematic meaning for these representative topic words. Commonly appearing words that do not directly contribute to the content—typically prepositions, pronouns, gerunds, etc.—are ignored.
The system uses these topic word and associative patterns to build n-dimensional signature vectors characterizing each document. The vectors are clustered and projected to 2-space to create two visualizations—ThemeView (see Image 1.1) and Galaxy (see Image 1.2). In these two visualization designs, a particular location on the map corresponds to a specific topic found in the repository, and the distance on this map corresponds to semantic similarities: semantically similar papers cluster around their major topics, while papers with different content tend to lie apart in the map.
The above process description on IN-SPIRE is generally applied to all the following tasks.
In Task 1.1, we query only the symposium papers because we want to see the major research areas at the symposium from 1995 to 2002. In Task 1.2, we query both the symposium and reference papers in the analysis to show the contribution of the symposium papers to the entire information visualization community.
IN-SPIRE can display the most salient topics of the most important clusters (in Galaxy) and peaks (in ThemeView), any data fields included in the analysis, and word usage statistics for each document. The topic labels in Images 1.1 and 1.2 are identified by analysts after reviewing the contents of the clusters. Part of this analysis process is shown in Tasks 2 and 3.
The overall shape of Image 1.2 is very similar to Image 1.1. If we flip Image 1.2 alongside the diagonal between lower left to upper right corners, we get almost the same distribution as Image 1.1.
Image 1.1 :
Insight 1.1:
The two highest and brightest peaks are Multidimensional Techniques (top, ~19 papers) and Interactive Techniques (right, 18 papers). Multidimensional Techniques focus more on statistical methods, whereas Interactive Techniques are mostly related to visual designs and interactions. Because of their strengths and dissimilarities, they occupy two of the four major corners in the visualization.
Trees, Graphs, and Hierarchies (~15 papers) are other topics that form a major peak on the left side. The theoretical theme of these papers also draws the Algorithms peak very close to their neighbor.
We also see that the Temporal Studies and Geographic peaks are side by side—an indication of a substantial number of articles (including cartogram and temporal distortion) that share the two topics.
We are not surprised to find that the Focus and Context peak is very close to the Multidimensional Techniques peak. Many symposium papers that address the focus and context problems also discuss multidimensional techniques.
Finally, the field of Database and Query Visualization that covers well-focused areas of data representation, modeling, and management forms a medium-sized peak in the lower right corner of the visualization.
Caption for exhibit:
IN-SPIRE’s ThemeView, a visual-analytic tool developed at the Pacific Northwest National Laboratory, presents the major technical contribution areas of the Information Visualization Symposium in aggregate over its 10-year history, as peaks in an information landscape.

Image 1.2 :
Insight 1.2:
Every major cluster has both green (symposium papers) and white (reference papers) dots. (The white dots look gray in the screen capture.) So the evolution of the symposium papers through the years is consistent with the community overall.
With the addition of more reference papers, we can clearly separate the topics between Graph Drawing (bottom) and Trees and Hierarchies (middle). For example, the original Treemap paper by Johnson and Shneiderman is in the Tree and Hierarchies cluster, which is far from the Graph Drawing cluster. And there is a large population of symposium papers (green dots) that cover both Graph Drawing and Trees. Tamara Munzner’s H3 paper is one.
Multidimensional Techniques (right) and Interactive Techniques (upper right) continue to be two of the major (bright blue) clusters. However, the distance between them is less than the distance in Image 1.1, indicating a common sharing of the two topics in many reference papers.
The very well-focused area of Database and Query Visualization continues to be isolated (top). A signature paper of this area is the Polaris paper by Stolte and Hanrahan.
While the Database and Query Visualization peak stays prominent, the neighboring Temporal Studies and Geographic peaks in Image 1.1 are no longer major topics in Image 1.2, indicating that they are more popular in the symposium than the community overall.
Caption for exhibit:
IN-SPIRE’s Galaxy, a visual-analytic tool developed at the Pacific Northwest National Laboratory, displays the knowledge contributions of the Information Visualization Symposium as a clustered information starfield. Topical clusters are shown and captioned in blue; symposium papers are green dots; reference papers are white stars; and selected symposium papers appear as yellow "supernovae."

TASK 2: Characterize the Research Areas and Their Evolution

Process:
For this analysis, using the full set of InfoVis papers and references, the text characterization is based solely on the abstract and title fields, with no upfront training. We did not use the keyword fields, because many documents do not contain them, and we wanted the text content to determine the similarity, not the human keywording.
Image 2.1:
Insight 2.1:
IN-SPIRE uses statistical word patterns to identify discriminating themes, cluster the documents based on thematic similarity, and automatically generate labeled visualizations as a start for user exploitation. The ThemeView landscape metaphor uses height reinforced with color to show dominant theme combinations. Spatial proximity shows thematic similarity. Here, the ThemeView shows that the dominant themes in this dataset include users, interfaces, techniques, and methods (labels on the highest peaks), with a range of other themes represented, including queries, time, trees, and graphs (labels on lower peaks). The Galaxy view is based on the same spatial layout, where each dot represents a single document. The blue clouds are a 2D version of the peaks in the landscape; the intensity corresponds to theme prominence but is more visually subdued, because the individual documents are the main focus in the Galaxy view. We created a group consisting of the InfoVis papers and colored them yellow, while the gray dots are the references. We see that the papers fall into several clumps corresponding to various dominant theme peaks. We also note some outlier clusters representing reference clumps that are relatively different from the main body of the collection. To explore unlabeled clumps, we can use the Probe tool, which shows not only the main themes but also their relative strength in that area. An arrow shows us exactly which spot in the thematic space is represented by the pop-up window (see video for interaction details).
Additional thoughts: Some capabilities that would help make this initial insight process easier include a label optimization approach to deal with overlapping labels and the ability to take the probe contents and make them "stick," essentially adding a label at any user-indicated spot.
Image 2.2:
Insight 2.2:
(Please see video to watch the interaction of time slicer with the Galaxy and ThemeView visualizations.) The time slicer shows us the distribution of documents over time (we open the time slicer by clicking on its icon in the tool bar). Here the yellow shows again the InfoVis papers, while the black is the references. We can also use the time slicer as a filter to constrain the portion of documents to show. The documents stay in the same locations, and the peaks and labels are recalculated to show the dominant themes corresponding to the documents in that time segment. This tool allows us to see several shifts in theme distribution over time. For example, the combination of years ’74-85 shows a scattering of themes, with the strongest in the algorithms area. In contrast, the years ’93-94 show a strong dominance of user design themes. Walking through time, starting in ’95, we can watch the shift in theme dominance for the InfoVis conference years. Many still show prominence in the user/interface/design region in the upper right, until we reach the years 2001-2002, which have stronger themes related to trees.
Additional thoughts: the current implementation does not provide a capability to directly compare two or more time windows in the visualizations or to watch a smooth transition between them.
Image 2.3:
Insight 2.3:
The outlier clumps in the main Galaxy view, although containing a number of documents, do not show up as high peaks in the landscape. They do not contain any symposium papers, as shown by the lack of yellow. If we use the Select tool to drag over and select them, then open the Document Viewer, we can read the underlying text. As the example above shows, we see that they contain no abstracts; hence the text analysis is based solely on the title. The outlier tool allows users to move empty or less interesting documents aside, so they can see more detail on other documents. We click the down arrow to move these documents down, and then click the Recalculate button to cause the remaining documents to recluster and reproject (see video for interaction). Users may use this approach to progressively converge on a set of interest. In the image above, we show the results after some documents are moved down. The labels show that additional themes, such as context, emerge as theme labels.
Additional thoughts: The initial interaction for moving documents back and forth from the main Galaxy to the Outliers area became cumbersome, especially when the current selection set matched what a user wanted to keep in the Galaxy. We solved this issue by making a Shortcuts pull-down with options to make the Galaxy view reflect the current selection set, the colored dots corresponding to groups, or the full document set. It also allows users to save custom views.
Image 2.4:
Insight 2.4:
Color-coding the InfoVis papers yellow in the Galaxy allows us to see how they relate to the larger body of references. We turn off the blue Theme clouds (item in the Settings menu) and reduce the labels to enhance visibility (a dialog box under Settings allows us to change both the number of labels and the number of words per label). If we use the time slicer to step through the years of the InfoVis conference, we see that the yellow documents are mostly in the upper right area (the user design area), but as time progresses, more seem to appear in the left and lower left. This suggests that the themes present in InfoVis papers have become more diverse over time. Please see video to watch the change over time.
Image 2.5:
Insight 2.5:
IN-SPIRE also allows the user to directly modify the choice of words that have a strong influence over the document clustering. For example, recall the strong clump in the upper right showing that the themes user and users are dominant among those documents. We can select those particular words and remove their influence by designating them as outlier terms. If we then click the Recalculate button, the documents are reclustered and projected. The image above shows the result. The green documents were ones that came from that upper right clump; we see that the green documents now distribute over several clumps. This indicates that the secondary themes among those documents were diverse.
Additional thoughts: There is currently no capability to show where a particular document moves during such reclustering. This would help with fine analysis. Also there is no animation of the transition, which might help the user follow the path of particular documents of interest.
Image 2.6:
Insight 2.6:
IN-SPIRE also allows multiple kinds of queries to let users explore interactions among word presence, thematic similarity, and time trends. To run a Boolean query, users open the query tool, select the scope, and type in the query text. They can make a group from the result. They can also combine multiple groups. Here we have made groups of documents that contain particular words related to popular themes indicated by the labels, separated by whether the documents are InfoVis papers or references. If we click on a group, the member documents are shown in the group color in the Galaxy. In the video, we highlight the different groups, noting that some query groups are more distributed around the Galaxy, such as evaluation, while others are more concentrated, such as the query groups on trees and on algorithms. This shows the correspondence between documents that contain particular words (query group members) and documents with thematic similarity (clusters). It suggests that documents containing the word "trees" tended to be thematically similar to each other, while documents containing the word "evaluation" tended to cover a variety of additional themes. We can also look at these groups over time. In one mode, the time slicer shows all the selected groups stacked to represent their portion of the whole time segment. In the alternate mode, it separates out each group to allow more detailed analysis of the individual trends. For example, we see here that the tree papers group grows over time while the query groups decrease. (Please see video for interaction details.)
Additional thoughts: As the utility of groups has increased, we have needed to continually look at the flexibility and ease of interaction. There is currently no easy way to cross-compare sets of groups, such as to compare all theme groups to all publication sources.

TASK 3: The People in InfoVis

Task 3.1: Where does a particular author/researcher fit within the research areas defined in task 2?

Process:
The initial process of Task 3 is the same as the one described in Task 1. The signature vectors, in this case, are built from the "abstract" and "title" fields of the XML file. The "author" field is used as paper identifier.
Image 3.1.1:
Insight 3.1.1:
Image 3.1.1 displays Robertson’s contribution to the InfoVis symposium on the visualization generated by all symposium papers. In a single view, one can see that (1) Robertson is the third author on the paper and his coauthors are Munzner and Guimbretiere, (2) the topic of his paper is hierarchical structures, and (3) this topic has been a major one in the conferences (with 15 papers in this cluster). The time slicer would show that in fact this topic received a lot of attention in the most recent conferences.
Image 3.1.2:
Insight 3.1.2:
Using IN-SPIRE zooming, labeling, and word usage statistics tools, we can focus on the cluster of 15 papers belonging to the tree-structure cluster. The figure reveals subtle differences. For example, Robertson’s paper is less insistent than other papers on trees and prefers to use the notion of hierarchies. Note also that this paper is closely related to another paper written by the lead author alone. The figure also reveals that Robertson’s paper is closely related to a series of papers written by Van De Wetering, Van Wijk, and Van Ham.

Task 3.2: What, if any, are the relationships between two or more or all researchers?

Image 3.2.1:
Insight 3.2.1:
From the direct contribution perspective, Robertson and Card are as apart as can be. Card has focused on design and mining the web, while Robertson, as mentioned earlier, focused on hierarchical structures. All contributions are centered within a popular cluster, arguably suggesting that these papers are of higher-level content within their respective main topics.
Image 3.2.2:
Insight 3.2.2:
This is the same analysis but this time looking at all contributions made by PNNL authors. The figure shows that these contributions cover a good part of the information landscape defined by all conference contributions, with better coverage of the lower right part of the map. No contribution on hierarchical structures and 3D were made. We can see subtle differences among PNNL authors. For example, Havre and Hetzler have focused on web mining and time analysis with original contributions in the sense that their papers appear isolated on the visualization. Thomas and Whitney contributions lie on a horizontal line. Wong’s papers are quite scattered, with a tendency to lie on the upper part of the map covered by PNNL contributions. These internal structures motivate a closer look at the horizontal and vertical axes of the visualization. Both axes seem to correspond to a change in focus from the internal core of visualization tools (top part of the map) to more external aspects, such as applications (for vertical axis) and interactivity (for the horizontal axis).

Task 3.3: Scope of a researcher influence

Process:
With IN-SPIRE, we can also rapidly infer the influence of a researcher in the different visualization research directions. This is done by selecting only the papers for which a given author is cited in the references. The following image shows this for Robertson.
Image 3.3:
Insight 3.3.1:
It is clear that Robertson’s influence goes far beyond his only contribution to the InfoVis symposium. Papers that cite his work, represented as highlighted green dots in the image, are represented in all major clusters and span all visualization research themes. He is very prominent in the clusters dealing with user interaction and hierarchical structures. However, he is far less influential in the lower left corner of the visualization, which covers topics such as web mining, time analysis, and DB querying.

ACKNOWLEDGMENTS

This work was sponsored by the National Visual Analytics Center (NVAC) located at the Pacific Northwest National Laboratory in Richland, WA. Special thanks go to Kris Cook, Lee Ann Dudney, Sharon Eaton, Harlan Foote, Jeffery Lettau, Sara Perez, Ken Perrine, and Wanda Mar who provided assistance of many forms throughout this contest effort. The Pacific Northwest National Laboratory is managed for the U.S. Department of Energy by Battelle Memorial Institute under Contract DE-AC06-76RL1830.

Web Accessibility

InfoVis 2004 Contest IN-SPIRE InfoVis 2004 Contest Entry