Lexing(Tokenizing): Converting a string to a list of tokens
Token: A meaningful string
Typically: keywords, identifiers, numbers,
"The short Wizard" \(\Rightarrow\) [Det;Adj;noun]
type token = Int of int | Add | Sub | LParen | RParen;;
tokenize "2 + ( 4 - 5)";;
= > [Int(2); add; LParen; Int(4); sub; Int(5); RParen]
How to Tokenize?
One way: RE and boring repitition
(* take a regexp *)
let re_num = Str.regexp "[0-9]+" in
let re_add = Str.regexp "+" in
let re_sub = Str.regexp "-" in
let rec mklst text =
if text = "" then [] else
if (Str.string_match re_num text 0) then
let matched = Str.matched_string text in
Int(int_of_string matched)::(mklst (String.sub text 1 ((String.length text)-(String.length matched))))
else if (Str.string_match re_add text 0) then
Add::(mklst (String.sub text 1 ((String.length text)-1)))
else if (Str.string_match re_sub text 0) then
Sub::(mklst (String.sub text 1 ((String.length text)-1)))
else (mklst (String.sub text 1 ((String.length text)-1))) in
mklst "2 + 3";;
Parsing: taken list to AST
can checks if text is grammatically correct
Many types of parsers: we will use recursive decent
RDP is top down; Grammar slides showed bottom up
Consider the basic grammar for polish notation
\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)
\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)
let parse_toks tokens =
let parse_num tokens =
if tokens = [] then failwith "error" else
let h::t = tokens in
if h = Int(0) then t else
(* ... *)
if h = Int(9) then t else
failwith "error" in
let rec parse-expr tokens =
if tokens = [] then failwith "error" else
let h::t = tokens in
if h = Add then
parse-expr (parse_num t)
else if h = Sub then
parse-expr (parse num t)
else parse_num tokens
in (parse-expr tokens) = [];;
Important: knowing which branch you are looking for
Important: knowing which branch you are looking for
Backtracking vs Predictive
Predictive: whats the next symbol?
First(nt): set of terminals nt represents
Only so good: conflicting first sets
Only so good: conflicting first sets
Converting to AST
Recall a Tree in OCaml
type tree = Leaf|Node of int * Node * Node;;
Node(2,Node(0,Leaf,Leaf),Leaf);;
Modify for Tokens
type expr = Num of int|Plus of expr * expr|Minus of expr * expr;;
(Add(Num 1, Num 2));;
Interpreting/Compiling: Take AST and return either code or a value (which is code)
compile (Add(Num 1, Num 2));;
=> "mov eax,1
mov ebx,2
add eax,ebx"
interpret (Add(Num 1, Num 2));;
=> 3
Typically some sort of recursive traversal of the AST
Interpreting/Compiling: Take AST and return either code or a value (which is code)
let compile ast =
match ast with
|Add(x,y,z) -> let num1 = compile y in
let num2 = compile z in
let cn1 = "mov eax,"^num1^"\n" in
let cn2 = "mov ebx,"^num2^"\n" in
let add = "add eax,ebx" in
cn1^cn2^add
|Num(x) -> string_of_int x
|_ -> failwith "error"