Issues in Understanding, Indexing, Querying, and Visualizing
Spatio-textual Spreadsheets
Numerous organizations including government agencies are sitting on
mountains of spreadsheet data that are becoming increasingly common on
the web, but whose contents remain out of reach via search engines
because direct links to the contents of their constituent cells are
rare. Thus spreadsheet data represent legacy databases, especially
since many of their underlying schemas are no longer accessible. The
goal of this research is to discover the schema according to which the
spreadsheet is constructed. The focus is on the spatio-textual
spreadsheet which is a spreadsheet where the values of the spatial
attributes are specified textually. Such spreadsheets support spatial
searches whose output is visual and whose utility is enhanced by being
able to handle spatial synonyms. This is done, in part, by devising
methods to automatically discover the spatial attributes of the
spreadsheet as well as how to distinguish between several instances of
them which arise due to the presence of a containment hierarchy. In
particular, use is made of spatial coherence which is manifested by
observing that spatial data in the same column are usually of the same
spatial type, while spatial data in the same spreadsheet row usually
exhibit a containment relationship. Moreover, adjacent or nearby rows
exhibit spreadsheet coherence in that they are usually similar. The
broad impact of this research is to make spreadsheet data a first
class citizen on the web with the same chances of being discovered and
accessed as data found in other documents.
NSF Grant
IIS-10-18475
PI: Hanan Samet
Relevant Publications:
- M. D. Adelfio,
H. Samet
Schema extraction for tabular data on the web.
PVLDB, 6(6):421-432, April
2013.[link]
Also Proceedings of the 39th International Conference on Very
Large Data Bases (VLDB)
Categories: [spreadsheets]
- M. D. Adelfio,
H. Samet
Structured toponym resolution using combined hierarchical place
categories.
In R. Purves and C. Jones, editors, Proceedings of 7th ACM
SIGSPATIAL Workshop on Geographic Information Retrieval (GIR'13), pages
49-56, Orlando, FL, November
2013.[link]
2013 GIR'13 Best Paper Award
Categories: [spatio-textual search
engine]
- M. D. Adelfio,
H. Samet
GeoWhiz: Using common categories for toponym resolution.
In C. A. Knoblock, P. Kröger, J. C. Krumm, M. Schneider, and
P. Widmayer, editors, Proceedings of the 21st ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems,
pages 542-545, Orlando, FL, November
2013.[link]
Categories: [spatio-textual search
engine]
- M. D. Adelfio,
H. Samet
Itinerary retrieval: Travelers, like traveling salesmen, prefer
efficient routes.
In R. Purves and C. Jones, editors, Proceedings of 8th ACM
SIGSPATIAL Workshop on Geographic Information Retrieval (GIR'14), Dallas,
TX, November
2014.[link]
Categories: [spatio-textual search
engine]
- M. D. Adelfio,
H. Samet
Automated itinerary visualization.
In Y. Huang, M. Gertz, J. C. Krumm, J. Sankaranarayanan, and
M. Schneider, editors, Proceedings of the 22nd ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems,
Dallas, TX, November
2014.[link]
Categories: [spatio-textual search
engine]
Theses
- M. D. Adelfio
Automated Structural and Spatial Comprehension of Data Tables.
PhD Thesis. University of Maryland, Department of Computer Science, 2015.
[link]
- M. D. Lieberman
Multifaceted Geotagging for Streaming News.
PhD Thesis. University of Maryland, Department of Computer Science, 2012.
[link]
Programs/Software
- GeoWhiz. A
demonstration of our method for interpreting lists of place names (such as
those found in a spreadsheet or table column). The method attempts to
identify a common thread of all entries in the list, which can be used to
more accurately interpret ambiguous place names.
- Spatial
Browser for Wikipedia Categories. An alternative way of
browsing location sets. Rather than showing each location as a marker on a
single map, we show a closeup of each location. This allows for the
comparison of visual attributes (via satellite imagery) of each
location.
- PhotoStand. An
image based browser that enables the use of a map query interface to
retrieve news photos associated with news articles that are in turn
associated with the principal locations that they mention, based on the data
from the NewsStand system.
- NewsStand. An
example application of a general framework that enables people to search for
information with a map-query interface. The NewsStand system monitors the
output of more than 10,000 RSS news feeds and incorporates new articles
within minutes of publication. Each article undergoes a geotagging
procedure, where location references are identified and interpreted,
allowing us to associate each article with the geographic locations that it
mentions.
- GeoXLS. A geographic
search system that enables users to submit a set of locations as a query
object Q and to find documents containing locations similar to those in
Q. The demonstration system supports searching location sets derived from
Wikipedia categories, news clusters (the NewsStand dataset), and a large
collection of geographic spreadsheets and data tables.
- Ontuition. A
system for filtering and mapping ontologies. The demonstration application
uses an ontology of medical and drug trials, sourced from the
NIH-administered website clinicaltrials.gov.