Assignment #2: Perl
Regular Expressions
Due: Sept. 24, 1999 6:00pm
One of the first steps in reading English text is to group letters in legal English
words. Similarly, compilers need to group character strings from a program into
meaningful programming language words, called tokens. This process is
known as lexical analysis.
Some common C++ tokens include:
- identifiers (variable names),
- keywords ( if, for, switch,
etc.),
- constants (integer, float, character),
- strings (quoted sequences of characters),
- operators (+, -, =, *=, etc.),
- and other special tokens ({, }, (, ), etc.).
To simplify lexical analysis, most programming languages require that different types
of tokens follow specific formatting conventions. These conventions (patterns) are
formally specified by regular expressions. (We discuss regular expressions in
class.)
If an input string conforms to the regular expression for type X (an
identifier, a constant, etc.), then the compiler can infer that the string is of type X.
Your assignment is to write a Perl program that classifies input strings into one of
the five following token types, loosely based on those in C++.
- ID. An identifier consists of a sequence of digits and letters, with the
first character required to be a letter. Permitted letters include all upper and
lower case English letters plus the underscore (_). Permitted
digits are the ten decimal digits (0..9).
- INT. An integer constant is a non-empty sequence of digits, optionally
preceded by a sign ( + or -).
- FLOAT. Each floating point constant consists of an integer part, a
decimal point, a fraction part, an exponent marker (e or E),
and an exponent part. The integer, fraction, and exponent parts (when present) are
non-empty sequences of digits. The integer and exponent parts may or may not be preceded
by a sign ( + or -).
Furthermore, floating point constants must satisfy the following rules and exceptions:
a. Either the integer part or the fraction part may be missing,
but at least one must be present.
b. Either the decimal point or the exponent (marker and part) may
be missing, but at least one must be present.
- STR. A string is a sequence of characters beginning and ending with a
double-quote mark (") and with any other normally typable character
in between. The delimiting quotes are not considered part of the string, but should
be kept with the string token for this exercise. A double-quote mark is allowed
inside of the string when preceded by a backslash (\); for example:
"I said, \"Hello!\"" is a string containing two double-quote marks.
Similarly, to include a backslash character in a string requires inserting two consecutive
backslashes ( \\) in the string.
- OP. Operators come from the set {+, -, *, /, =, &, |, ++, --}.
The input data for your Perl program will consist of one token per input line (without
any whitespace). If the input matches one of the five classes, print out the input
line number, the type of the token, and its value. Otherwise, print the input line
number, UNK to indicate an unrecognized token, and the contents of the
line. Your program should read from stdin and write to stdout (redirected to an
output file for submission).
For example, the input:
aaa
-111
11.1E-6
--
"Don't quote \" me."
1a
should result in the output:
1 ID aaa
2 INT -111
3 FLOAT 11.1E-6
4 OP --
5 STR "Don't quote \" me."
6 UNK 1a
Handing in the assignment
Instructions for submitting your work:
- Name your Perl source file assignment2.pl, and make it executable;
- Your Perl file should read test data from the input data file specified on the command
line, by redirecting stdin (e.g., assignment2.pl < test_data.txt);
- tar the file(s) for submission (e.g., tar cvf submit2.tar assignment2.pl);
- submit the tar file: ~as330003/alpha.bin/submit 2 submit2.tar.
Your work may not be graded if these procedures are not followed exactly.
A large penalty will be assessed if the required output format is not followed
exactly.