8 Dodger: addressing a lack of character
There are 11 types of values...
8.1 Characters
In Dupe: a duplicity of types, we saw how to accomodate disjoint datatypes, namely integers and booleans. Let’s add yet another: the character type. Conceptually, there’s not much new here (hence we stay in the "D" family of languages); we’re simply adding a third type, which we distinguish from the other two by using more bits to tag the type.
We’ll call it Dodger.
To the syntax of expressions, we add character literals.
We will also add the following operations:
char? : Any -> Boolean: predicate for recognizing character values
integer->char : Integer -> Character: converts from integers to characters
char->integer : Character -> Integer: converts from integers to characters
Abstract syntax is modelled with the following datatype definition:
#lang racket (provide Int Bool Char Prim1 If) ;; type Expr = ;; | (Int Integer) ;; | (Bool Boolean) ;; | (Char Character) ;; | (Prim1 Op Expr) ;; | (If Expr Expr Expr) ;; type Op = 'add1 | 'sub1 | 'zero? ;; | 'char? | 'integer->char | 'char->integer (struct Int (i) #:prefab) (struct Bool (b) #:prefab) (struct Char (c) #:prefab) (struct Prim1 (p e) #:prefab) (struct If (e1 e2 e3) #:prefab)
The s-expression parser is defined as follows:
#lang racket (provide parse) (require "ast.rkt") ;; S-Expr -> Expr (define (parse s) (cond [(integer? s) (Int s)] [(boolean? s) (Bool s)] [(char? s) (Char s)] [else (match s [(list 'add1 e) (Prim1 'add1 (parse e))] [(list 'sub1 e) (Prim1 'sub1 (parse e))] [(list 'zero? e) (Prim1 'zero? (parse e))] [(list 'char? e) (Prim1 'char? (parse e))] [(list 'integer->char e) (Prim1 'integer->char (parse e))] [(list 'char->integer e) (Prim1 'char->integer (parse e))] [(list 'if e1 e2 e3) (If (parse e1) (parse e2) (parse e3))] [_ (error "Parse error")])]))
8.2 Characters in Racket
Racket has a Character data type for representing single letters. A Racket character can represent any of the 1,114,112 Unicode code points.
The way a character is most often written is an octothorp, followed by a backslash, followed by the character itself. So for example the character a is written #\a. The character λ is written #\λ. The character 文 is written #\文.
A character can be converted to an integer and vice versa:
Examples
> (char->integer #\a) 97
> (char->integer #\λ) 955
> (char->integer #\文) 25991
> (integer->char 97) #\a
> (integer->char 955) #\λ
> (integer->char 25991) #\文
However, integers in the range of valid code points are acceptable to integer->char and using any other integer will produce an error:
Examples
> (integer->char -1) integer->char: contract violation
expected: valid-unicode-scalar-value?
given: -1
> (integer->char 55296) integer->char: contract violation
expected: valid-unicode-scalar-value?
given: 55296
There are a few other ways to write characters (see the Racket Reference for the details), but you don’t have to worry much about this since read takes care of reading characters in all their different forms. The run-time system, described below, takes care of printing them.
8.3 Meaning of Dodger programs
The semantics are omitted for now (there’s really nothing new that’s interesting).
The interpeter is much like that of Dupe, except we have a new base case:
#lang racket (provide interp) (require "ast.rkt" "interp-prim.rkt") ;; type Value = ;; | Integer ;; | Boolean ;; | Character ;; Expr -> Value (define (interp e) (match e [(Int i) i] [(Bool b) b] [(Char c) c] [(Prim1 p e) (interp-prim1 p (interp e))] [(If e1 e2 e3) (if (interp e1) (interp e2) (interp e3))]))
And the interpretation of primitives is extended:
#lang racket (provide interp-prim1) ;; Op1 Value -> Value (define (interp-prim1 op v) (match op ['add1 (add1 v)] ['sub1 (sub1 v)] ['zero? (zero? v)] ['char? (char? v)] ['integer->char (integer->char v)] ['char->integer (char->integer v)]))
The meaning of characters and their operations are just lifted from Racket.
We can try out some examples:
Examples
> (interp (Char #\a)) #\a
> (interp (Char #\b)) #\b
> (interp (Prim1 'char? (Char #\a))) #t
> (interp (Prim1 'char? (Bool #t))) #f
> (interp (Prim1 'char->integer (Char #\a))) 97
> (interp (Prim1 'integer->char (Prim1 'char->integer (Char #\a)))) #\a
Just as in Dupe, type errors result in the interpreter crashing:
Examples
> (interp (Prim1 'char->integer (Bool #f))) char->integer: contract violation
expected: char?
given: #f
Also, not every integer corresponds to a character, so when integer->char is given an invalid input, it crashes (more on this in a minute):
Examples
> (interp (Prim1 'integer->char (Int -1))) integer->char: contract violation
expected: valid-unicode-scalar-value?
given: -1
8.4 Ex uno plures iterum: Out of One, Many... Again
We have exactly the same problem as in Dupe: we need to represent different kinds of values within our one primordial datatype: the 64-bit integer.
We can use the following encoding scheme:
Integers have #b0 as the last bit; the other bits describe the integer.
Character have #b01 as the last bits; the other bits describe the character.
True is #b011 and False is #b111.
Notice that each kind of value is disjoint.
We can write an interpreter that operates at the level of bits just as we did for Dupe; notice that it only ever constructs characters at the very end when converting from bits to values. Let’s first define our bit encodings:
#lang racket (provide (all-defined-out)) (define int-shift 1) (define char-shift 2) (define type-int #b0) (define type-char #b01) (define mask-char #b11) (define val-true #b011) (define val-false #b111) (define (bits->value b) (cond [(= type-int (bitwise-and b #b1)) (arithmetic-shift b (- int-shift))] [(= type-char (bitwise-and b #b11)) (integer->char (arithmetic-shift b (- char-shift)))] [(= b val-true) #t] [(= b val-false) #f] [else (error "invalid bits")])) (define (value->bits v) (cond [(integer? v) (arithmetic-shift v int-shift)] [(char? v) (bitwise-ior type-char (arithmetic-shift (char->integer v) char-shift))] [(eq? v #t) val-true] [(eq? v #f) val-false]))
And now the interpreter:
#lang racket (provide interp interp-bits) (require "ast.rkt" "types.rkt") ;; type Value = ;; | Integer ;; | Boolean ;; | Character ;; type Bits = Integer ;; Expr -> Value (define (interp e) (bits->value (interp-bits e))) ;; Expr -> Bits (define (interp-bits e) (match e [(Int i) (value->bits i)] [(Char c) (value->bits c)] [(Bool b) (value->bits b)] [(Prim1 'add1 e0) (+ (interp-bits e0) (value->bits 1))] [(Prim1 'sub1 e0) (- (interp-bits e0) (value->bits 1))] [(Prim1 'zero? e) (if (zero? (interp-bits e)) val-true val-false)] [(Prim1 'char? e0) (if (= type-char (bitwise-and (interp-bits e0) #b11)) val-true val-false)] [(Prim1 'char->integer e0) (arithmetic-shift (arithmetic-shift (interp-bits e0) (- char-shift)) int-shift)] [(Prim1 'integer->char e0) (bitwise-ior (arithmetic-shift (arithmetic-shift (interp-bits e0) (- int-shift)) char-shift) type-char)] [(If e1 e2 e3) (if (= (interp-bits e1) val-false) (interp-bits e3) (interp-bits e2))]))
8.5 A Compiler for Dodger
Compilation is pretty easy, particularly since we took the time to develop the bit-level interpreter. The compiler uses the same bit-level representation of values and uses logical operations to implement the same bit manipulating operations:
#lang racket (provide (all-defined-out)) (require "ast.rkt" "types.rkt" a86/ast) ;; Expr -> Asm (define (compile e) (prog (Label 'entry) (compile-e e) (Ret))) ;; Expr -> Asm (define (compile-e e) (match e [(Int i) (compile-value i)] [(Bool b) (compile-value b)] [(Char c) (compile-value c)] [(Prim1 p e) (compile-prim p e)] [(If e1 e2 e3) (compile-if e1 e2 e3)])) ;; Value -> Asm (define (compile-value v) (seq (Mov 'rax (value->bits v)))) ;; Op Expr -> Asm (define (compile-prim p e) (seq (compile-e e) (match p ['add1 (Add 'rax (value->bits 1))] ['sub1 (Sub 'rax (value->bits 1))] ['zero? (let ((l1 (gensym))) (seq (Cmp 'rax 0) (Mov 'rax val-true) (Je l1) (Mov 'rax val-false) (Label l1)))] ['char? (let ((l1 (gensym))) (seq (And 'rax mask-char) (Xor 'rax type-char) (Cmp 'rax 0) (Mov 'rax val-true) (Je l1) (Mov 'rax val-false) (Label l1)))] ['char->integer (seq (Sar 'rax char-shift) (Sal 'rax int-shift))] ['integer->char (seq (Sar 'rax int-shift) (Sal 'rax char-shift) (Xor 'rax type-char))]))) ;; Expr Expr Expr -> Asm (define (compile-if e1 e2 e3) (let ((l1 (gensym 'if)) (l2 (gensym 'if))) (seq (compile-e e1) (Cmp 'rax val-false) (Je l1) (compile-e e2) (Jmp l2) (Label l1) (compile-e e3) (Label l2))))
8.6 A Run-Time for Dodger
The only interesting aspect of Dodger, really, is that we need to add run-time support for printing character literals.
/* Bit layout of values Values are either: - Integers: end in #b0 - Characters: end in #b01 - True: #b11 - False: #b111 */ #define int_shift 1 #define int_type_mask ((1 << int_shift) - 1) #define int_type_tag (0 << (int_shift - 1)) #define nonint_type_tag (1 << (int_shift - 1)) #define char_shift (int_shift + 1) #define char_type_mask ((1 << char_shift) - 1) #define char_type_tag ((0 << (char_shift - 1)) | nonint_type_tag) #define nonchar_type_tag ((1 << (char_shift - 1)) | nonint_type_tag) #define val_true ((0 << char_shift) | nonchar_type_tag) #define val_false ((1 << char_shift) | nonchar_type_tag) #define val_empty ((2 << char_shift) | nonchar_type_tag)
#include <stdio.h> #include <inttypes.h> #include "types.h" int64_t entry(); void print_result(int64_t); void print_char(int64_t); int main(int argc, char** argv) { print_result(entry());return 0; } void print_result(int64_t result) { if (int_type_tag == (int_type_mask & result)) { "%" PRId64 "\n", result >> int_shift); printf(else if (char_type_tag == (char_type_mask & result)) { } print_char(result);"\n"); printf(else { } switch (result) { case val_true: "#t\n"); break; printf(case val_false: "#f\n"); break; printf( } } }
The work of printing characters is moved to it’s own file, which can be linked against the run-time:
#include <stdio.h> #include <inttypes.h> #include "types.h" void print_codepoint(int64_t); void print_char (int64_t v) { int64_t codepoint = v >> char_shift; "#\\"); printf(switch (codepoint) { case 0: "nul"); break; printf(case 8: "backspace"); break; printf(case 9: "tab"); break; printf(case 10: "newline"); break; printf(case 11: "vtab"); break; printf(case 12: "page"); break; printf(case 13: "return"); break; printf(case 32: "space"); break; printf(case 127: "rubout"); break; printf(default: print_codepoint(v); } } void print_codepoint(int64_t v) { int64_t codepoint = v >> char_shift; // Print using UTF-8 encoding of codepoint // https://en.wikipedia.org/wiki/UTF-8 if (codepoint < 128) { "%c", (char) codepoint); printf(else if (codepoint < 2048) { } "%c%c", printf(char)(codepoint >> 6) | 192, (char)codepoint & 63) | 128); ((else if (codepoint < 65536) { } "%c%c%c", printf(char)(codepoint >> 12) | 224, (char)(codepoint >> 6) & 63) | 128, ((char)codepoint & 63) | 128); ((else { } "%c%c%c%c", printf(char)(codepoint >> 18) | 240, (char)(codepoint >> 12) & 63) | 128, ((char)(codepoint >> 6) & 63) | 128, ((char)codepoint & 63) | 128); (( } }
Will these pieces in place, we can try out some examples:
Examples
> (define (run e) (bits->value (asm-interp (compile (parse e))))) > (run '#\a) #\a
> (run '(integer->char (add1 (char->integer #\a)))) #\b
> (run '(integer->char 955)) #\λ