Assignment #2: Perl


Regular Expressions                 Due:   Feb. 24, 2000 6:00pm

 

One of the first steps in reading English text is to group letters in legal English words. Similarly,  compilers need to group character strings from a program into meaningful  programming language words,  called tokens. This process is known as lexical analysis.

Some common C/C++ tokens include:

To simplify lexical analysis, most programming languages require that different types of tokens follow specific formatting conventions.  These conventions (patterns) are formally specified by regular expressions.  (We discuss regular expressions in class.)

If an input string conforms to the regular expression for type X (an identifier, a constant, etc.), then the compiler can infer that the string is of type X.

Your assignment is to write a Perl program that classifies input strings into one of the six following token types, loosely based on those in C or C++.

  1. ID.  An identifier consists of a sequence of digits and letters, with the first character required to be a letter.  Permitted letters include all upper and lower case English letters plus the underscore (_).  Permitted digits are the ten decimal digits (0..9).
  2. INT.  An integer constant is a non-empty sequence of digits, optionally preceded by a sign ( + or -), and optionally followed by a suffix consisting of an unsigned type indicator (u or U) and/or a long type indicator (l or L). If both an unsigned and a long type indicator are present, they can be in any order.
  3. FLOAT.  Each floating point constant consists of an integer part, a decimal point, a fraction part, an exponent marker (e or  E), and an exponent part.  The integer, fraction, and exponent parts (when present) are non-empty sequences of digits. The integer and exponent parts may or may not be preceded by a sign ( + or -).
    Furthermore, floating point constants must satisfy the following rules and exceptions:
        a.  Either the integer part or the fraction part may be missing, but at least one must be present.
        b.  Either the decimal point or the exponent (marker and part) may be missing, but at least one must be present.
  4. STR. A string is a sequence of characters beginning and ending with a double-quote mark (") and with any other normally typable character in between.  The delimiting quotes are not considered part of the string, but should be kept with the string token for this exercise.  A double-quote mark is allowed inside of the string when preceded by a backslash (\); for example: "I said, \"Hello!\"" is a string containing two double-quote marks.
  5. CHAR. A character constant is a single typable character within single quote-marks (').
  6. OP.  Operators come from the set {+, -, *, /, =, <, >, <=, >=, &, |, ++, --}.

The input data for your Perl program will consist of one token per input line (without any whitespace).  If the input matches one of the six classes, print out the input line number, the type of the token, and its value.  Otherwise, print the input line number, UNK to indicate an unrecognized token, and the contents of the line.  Your program should read from stdin and write to stdout (redirected to an output file for submission).

For example, the input:

aaa
-111
11.1E-6
--
"Don't quote \" me."
'z'
1a

should result in the output:

1 ID aaa
2 INT -111
3 FLOAT 11.1E-6
4 OP --
5 STR "Don't quote \" me."
6 CHAR 'z'
7 UNK 1a

Handing in the assignment

Instructions for submitting your work:

Your work may not be graded if these procedures are not followed exactly.

A large penalty will be assessed if the required output format is not followed exactly.