Parallel Text Search


Description

This application is a trivial parallelization of agrep program from University of Arizona. agrep is capable of partial match and approximate searches [1]. Our parallel version, called pgrep , can operate in one of two modes. In the first mode, the list of files to be searched is partitioned, round-robin, among the processors. In the second mode, each input file is block-partitioned among the processors, so that each processor searches a distinct portion of the file. This application performs I/O using synchronous read() operations.

Input Dataset

To drive this application, we used the "/users" document hierarchy on the University of Maryland web server. The hierarchy contained 475~MB in 18,655 files.

Workload

We used a complex pattern including wildcards, disjunction and substitution: "b#dy,[a-z]#gif,l#[n-z]t#,F#" .
In this pattern, "," is OR-construct; "#" is wildcard; and "[a-z]" is all the characters between "a" and "z". We allowed 3 substitutions on this pattern, i.e. any string which can be obtained by substituting 3 characters is also accepted by the pgrep.

Traces

You can download the trace files in the following formats:

References

[1] Sun Wu and Udi Manber. agrep - a fast approximate pattern-matching tool . In USENIX Conference Proceedings, pages 153-162, San Fransisco, CA, Winter 1992.

----------------------------------------------------------------------
Last updated on Tue May 27 12:37:44 EDT 1997 by Mustafa Uysal (uysal@cs.umd.edu ).