12 Hustle: heaps and lists
A little and a little, collected together, become a great deal; the heap in the barn consists of single grains, and drop and drop makes an inundation.
12.1 Inductive data
So far all of the data we have considered can fit in a single machine word (64-bits). Well, integers can’t, but we truncated them and only consider, by fiat, those integers that fit into a register.
In the Hustle language, we will add two inductively defined data types, boxes and pairs, which will require us to relax this restriction.
Boxes are like unary pairs, they simply hold a value, which can be projected out. Pairs hold two values which each can be projected out.
The new operations include constructors (box e) and (cons e0 e1) and projections (unbox e), (car e), and (cdr e).
Usually boxes are mutable data structures, like OCaml’s ref type, but we will examine this aspect later. For now, we treat boxes as immutable data structures.
We use the following grammar for Hustle:
We can model this as an AST data type:
#lang racket (provide (all-defined-out)) ;; type Expr = ;; | Integer ;; | Boolean ;; | Variable ;; | Prim1 Expr ;; | Prim2 Expr Expr ;; | If Expr Expr Expr ;; | Let (Binding list) Expr ;; | Nil ;; type Prim1 = 'add1 | 'sub1 | 'zero? | box | unbox | car | cdr ;; type Prim2 = '+ | '- | cons ;; type Binding = Variable Expr ;; type Variable = Symbol (except 'add1 'sub1 'if, etc.) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;; The AST data structure ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; The AST can be viewed as having 'kinds' of nodes. ;; ;; * The nodes that represnt an expression themselves ;; ;; * The nodes that are part of an expression, but no an expression themselves ;; The below are the former: (struct int-e (i) #:transparent) (struct bool-e (b) #:transparent) (struct var-e (v) #:transparent) (struct prim-e (p es) #:transparent) (struct if-e (e t f) #:transparent) (struct let-e (bs b) #:transparent) (struct nil-e () #:transparent) ;; The next is the latter: ;; A binding holds a symbol representing the bound variable and ;; Expr that represents the value that will be bound to that variable (struct binding (v e) #:transparent) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;; AST utility functions (predicates) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (define unops '(add1 sub1 zero? box unbox car cdr)) (define biops '(+ - cons)) ;; Any -> Boolean (define (prim? x) (and (symbol? x) (memq x (append unops biops)))) ;; Any -> Boolean (define (biop? x) (and (symbol? x) (memq x biops))) ;; Any -> Boolean (define (unop? x) (and (symbol? x) (memq x unops))) (define (value? v) (or (int-e? v) (bool-e? v))) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;; AST utility functions (getters) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; It will sometimes be useful to get the list of all the variables that are ;; introduced by a `let` ;; [Binding] -> [Symbol] (define (get-vars bs) (match bs ['() '()] [(cons (binding v _) bs) (cons v (get-vars bs))])) ;; Get all of the _definitions_ from a list of bindings ;; [Binding] -> [Expr] (define (get-defs bs) (match bs ['() '()] [(cons (binding _ def) bs) (cons def (get-defs bs))])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;; AST utility functions (printers) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; We have switched to using `#:transparent` above, so this should only be ;; necessary if you're desperate when debugging :'( ;; Given an AST, construct an sexpr that has the same shape (define (ast-debug a) (match a [(int-e i) `(int-e ,i)] [(bool-e b) `(bool-e ,b)] [(var-e v) `(var-e ,v)] [(nil-e) ''()] [(prim-e p es) `(prim-e ,p ,@(map ast-debug es))] [(if-e e t f) `(if-e ,(ast-debug e) ,(ast-debug t) ,(ast-debug f))] [(let-e bs b) `(let-e ,(binding-debug bs) ,(ast-debug b))])) (define (binding-debug bnds) (match bnds ['() '()] [(cons (binding v e) bnds) `((,v ,(ast-debug e)) ,@(binding-debug bnds))]))
12.2 Meaning of Hustle programs
The meaning of Hustle programs is just a slight update to Grift programs, namely we add a few new primitives.
The update to the semantics is just an extension of the semantics of primitives:
The interpreter similarly has an update to the interp-prim function. On the relevant bits of interp.rkt are shown:
; Any -> Boolean (define (prim? x) (and (symbol? x) (memq x '(add1 sub1 + - zero? ; New box unbox cons car cdr)))) ; Prim [Listof Answer] -> Answer (define (interp-prim p as) (match (cons p as) [(list 'add1 (? integer? i0)) (+ i0 1)] [(list 'sub1 (? integer? i0)) (- i0 1)] [(list 'zero? (? integer? i0)) (zero? i0)] [(list '+ (? integer? i0) (? integer? i1)) (+ i0 i1)] [(list '- (? integer? i0) (? integer? i1)) (- i0 i1)] ; New for Hustle [(list 'box v0) (box v0)] [(list 'unbox (? box? v0)) (unbox v0)] [(list 'cons v0 v1) (cons v0 v1)] [(list 'car (cons v0 v1)) v0] [(list 'cdr (cons v0 v1)) v1] [_ 'err]))
Inductively defined data is easy to model in the semantics and interpreter because we can rely on inductively defined data at the meta-level in math or Racket, respectively.
The real trickiness comes when we want to model such data in an impoverished setting that doesn’t have such things, which of course is the case in assembly.
The problem is that a value such as (box v) has a value inside it. Pairs are even worse: (cons v0 v1) has two values inside it. If each value is represented with 64 bits, it would seem a pair takes at a minimum 128-bits to represent (plus we need some bits to indicate this value is a pair). What’s worse, those v0 and v1 may themselves be pairs or boxes. The great power of inductive data is that an arbitrarily large piece of data can be constructed. But it would seem impossible to represent each piece of data with a fixed set of bits.
The solution is to allocate such data in memory, which can in principle be arbitrarily large, and use a pointer to refer to the place in memory that contains the data.
12.3 A Compiler for Hustle
The first thing do is make another distinction in the kind of values in our language. Up until now, each value could be represented in a register. We now call such values immediate values.
We introduce a new category of values which are pointer values. We will (for now) have two types of pointer values: boxes and pairs.
So we now have a kind of hierarchy of values:
- values |
+ pointers (non-zero in last 3 bits) |
* boxes |
* pairs |
+ immediates (zero in last three bits) |
* integers |
* characters |
* booleans |
* ... |
We will represent this hierarchy by shifting all the immediates over 3 bits and using the lower 3 bits to tag things as either being immediate (tagged #b000) or a box or pair. To recover an immediate value, we just shift back to the right 3 bits.
The pointer types will be tagged in the lowest three bits. A box value is tagged #b001 and a pair is tagged #b010. The remaining 61 bits will hold a pointer, i.e. an integer denoting an address in memory.
The idea is that the values contained within a box or pair will be located in memory at this address. If the pointer is a box pointer, reading 64 bits from that location in memory will produce the boxed value. If the pointer is a pair pointer, reading the first 64 bits from that location in memory will produce one of the value in the pair and reading the next 64 bits will produce the other. In other words, constructors allocate and initialize memory. Projections dereference memory.
The representation of pointers will follow a slightly different scheme than that used for immediates. Let’s first talk a bit about memory and addresses.
A memory location is represented (of course, it’s all we have!) as a number. The number refers to some address in memory. On an x86 machine, memory is byte-addressable, which means each address refers to a 1-byte (8-bit) segment of memory. If you have an address and you add 1 to it, you are refering to memory starting 8-bits from the original address.
We will make a simplifying assumption and always store things in memory in multiples of 64-bit chunks. So to go from one memory address to the next word of memory, we need to add 8 (1-byte times 8 = 64 bits) to the address.
What is 8 in binary? #b1000
What’s nice about this is that if we start from a memory location that is “word-aligned,” i.e. it ends in #b000, then every 64-bit index also ends in #b000.
What this means is that every address we’d like to represent has #b000 in its least signficant bits. We can therefore freely uses these three bits to tag the type of the pointer without needing to shift the address around. If we have a box pointer, we can simply zero out the box type tag to obtain the address of the boxes content. Likewise with pairs.
We use a register, 'rdi, to hold the address of the next free memory location in memory. To allocate memory, we simply increment the content of 'rdi by a multiple of 8. To initialize the memory, we just write into the memory at that location. To contruct a pair or box value, we just tag the unused bits of the address.
So for example the following creates a box containing the value 7:
`((mov rax ,(arithmetic-shift 7 imm-shift)) (mov (offset rdi 0) rax) ; write '7' into address held by rdi (mov rax rdi) ; copy pointer into return register (or rax ,type-box) ; tag pointer as a box (add rdi 8)) ; advance rdi one word
If 'rax holds a box value, we can “unbox” it by erasing the box tag, leaving just the address of the box contents, then dereferencing the memory:
`((xor rax ,type-box) ; erase the box tag (mov rax (offset rax 0))) ; load memory into rax
Pairs are similar. Suppose we want to make (cons 3 4):
`((mov rax ,(arithmetic-shift 3 imm-shift)) (mov (offset rdi 0) rax) ; write '3' into address held by rdi (mov rax ,(arithmetic-shift 4 imm-shift)) (mov (offset rdi 1) rax) ; write '4' into word after address held by rdi (mov rax rdi) ; copy pointer into return register (or rax ,type-pair) ; tag pointer as a pair (add rdi 16)) ; advance rdi 2 words
If 'rax holds a pair value, we can project out the elements by erasing the pair tag, leaving just the address of the pair contents, then dereferencing either the first or second word of memory:
`((xor rax ,type-pair) ; erase the pair tag (mov rax (offset rax 0)) ; load car into rax (mov rax (offset rax 1))) ; or... load cdr into rax
From here, writing the compiler for box, unbox, cons, car, and cdr is just a matter of putting together pieces we’ve already seen such as evaluating multiple subexpressions and type tag checking before doing projections.
The complete compiler is given below.
#lang racket (provide (all-defined-out)) (require "ast.rkt") ;; An immediate is anything ending in #b000 ;; All other tags in mask #b111 are pointers (define result-shift 3) (define result-type-mask (sub1 (arithmetic-shift 1 result-shift))) (define type-imm #b000) (define type-box #b001) (define type-pair #b010) (define imm-shift (+ 3 result-shift)) (define imm-type-mask (sub1 (arithmetic-shift 1 imm-shift))) (define imm-type-int (arithmetic-shift #b000 result-shift)) (define imm-val-true (arithmetic-shift #b001 result-shift)) (define imm-val-false (arithmetic-shift #b010 result-shift)) (define imm-val-empty (arithmetic-shift #b011 result-shift)) ;; Allocate in 64-bit (8-byte) increments, so pointers ;; end in #b000 and we tag with #b001 for boxes, etc. ;; type CEnv = (Listof (Maybe Variable)) ;; type Imm = Integer | Boolean | '() ;; Expr -> Asm (define (compile e) `(entry ,@(compile-e e '()) ret err (push rbp) (call error) ret)) ;; Expr CEnv -> Asm (define (compile-e e c) (match e [(? imm? i) (compile-imm i)] [(var-e v) (compile-var v c)] [(prim-e (? prim? p) es) (compile-prim p es c)] [(if-e p t f) (compile-if p t f c)] [(let-e (list b) body) (compile-let b body c)])) ;; Any -> Boolean (define (imm? x) (or (int-e? x) (bool-e? x) (nil-e? x))) ;; Imm -> Asm (define (compile-imm i) `((mov rax ,(match i [(int-e i) (arithmetic-shift i imm-shift)] [(bool-e b) (if b imm-val-true imm-val-false)] [(nil-e) imm-val-empty])))) (define (compile-prim p es c) (match (cons p es) [`(add1 ,e0) (compile-add1 e0 c)] [`(sub1 ,e0) (compile-sub1 e0 c)] [`(zero? ,e0) (compile-zero? e0 c)] [`(+ ,e0 ,e1) (compile-+ e0 e1 c)] [`(- ,e0 ,e1) (compile-- e0 e1 c)] ;; New for Hustle [`(box ,e0) (compile-box e0 c)] [`(unbox ,e0) (compile-unbox e0 c)] [`(cons ,e0 ,e1) (compile-cons e0 e1 c)] [`(car ,e0) (compile-car e0 c)] [`(cdr ,e0) (compile-cdr e0 c)] [_ (error (format "prim applied to wrong number of args: ~a ~a" p es))])) ;; Variable CEnv -> Asm (define (compile-var x c) (let ((i (lookup x c))) `((mov rax (offset rsp ,(- (add1 i))))))) ;; Expr CEnv -> Asm (define (compile-box e0 c) (let ((c0 (compile-e e0 c))) `(,@c0 (mov (offset rdi 0) rax) (mov rax rdi) (or rax ,type-box) (add rdi 8)))) ; allocate 8 bytes ;; Expr CEnv -> Asm (define (compile-unbox e0 c) (let ((c0 (compile-e e0 c))) `(,@c0 ,@assert-box (xor rax ,type-box) (mov rax (offset rax 0))))) ;; Expr Expr CEnv -> Asm (define (compile-cons e0 e1 c) (let ((c0 (compile-e e0 c)) (c1 (compile-e e1 (cons #f c)))) `(,@c0 (mov (offset rsp ,(- (add1 (length c)))) rax) ,@c1 (mov (offset rdi 0) rax) (mov rax (offset rsp ,(- (add1 (length c))))) (mov (offset rdi 1) rax) (mov rax rdi) (or rax ,type-pair) (add rdi 16)))) ;; Expr CEnv -> Asm (define (compile-car e0 c) (let ((c0 (compile-e e0 c))) `(,@c0 ,@assert-pair (xor rax ,type-pair) ; untag (mov rax (offset rax 1))))) (define (zero?-asm l0 l1) `(,@assert-integer (cmp rax 0) (mov rax ,imm-val-false) (jne ,l0) (mov rax ,imm-val-true) ,l0)) ;; Expr CEnv -> Asm (define (compile-cdr e0 c) (let ((c0 (compile-e e0 c))) `(,@c0 ,@assert-pair (xor rax ,type-pair) ; untag (mov rax (offset rax 0))))) ;; Expr CEnv -> Asm (define (compile-add1 e0 c) (let ((c0 (compile-e e0 c))) `(,@c0 ,@assert-integer (add rax ,(arithmetic-shift 1 imm-shift))))) ;; Expr CEnv -> Asm (define (compile-sub1 e0 c) (let ((c0 (compile-e e0 c))) `(,@c0 ,@assert-integer (sub rax ,(arithmetic-shift 1 imm-shift))))) ;; Expr CEnv -> Asm (define (compile-zero? e0 c) (let ((c0 (compile-e e0 c)) (l0 (gensym)) (l1 (gensym))) `(,@c0 ,@(zero?-asm l0 l1)))) ;; Expr Expr Expr CEnv -> Asm (define (compile-if e0 e1 e2 c) (let ((c0 (compile-e e0 c)) (c1 (compile-e e1 c)) (c2 (compile-e e2 c)) (l0 (gensym)) (l1 (gensym))) `(,@c0 (cmp rax ,imm-val-false) (je ,l0) ,@c1 (jmp ,l1) ,l0 ,@c2 ,l1))) ;; Variable Expr Expr CEnv -> Asm (define (compile-let b e1 c) (match b [(binding v def) (let ((c0 (compile-e def c)) (c1 (compile-e e1 (cons v c)))) `(,@c0 (mov (offset rsp ,(- (add1 (length c)))) rax) ,@c1))] [_ (error "Compile-let can only handle bindings")])) ;; Expr Expr CEnv -> Asm (define (compile-+ e0 e1 c) (let ((c1 (compile-e e1 c)) (c0 (compile-e e0 (cons #f c)))) `(,@c1 ,@assert-integer (mov (offset rsp ,(- (add1 (length c)))) rax) ,@c0 ,@assert-integer (add rax (offset rsp ,(- (add1 (length c)))))))) ;; Expr Expr CEnv -> Asm (define (compile-- e0 e1 c) (let ((c1 (compile-e e1 c)) (c0 (compile-e e0 (cons #f c)))) `(,@c1 ,@assert-integer (mov (offset rsp ,(- (add1 (length c)))) rax) ,@c0 ,@assert-integer (sub rax (offset rsp ,(- (add1 (length c)))))))) ;; Variable CEnv -> Natural (define (lookup x cenv) (match cenv ['() (error "undefined variable:" x)] [(cons y cenv) (match (eq? x y) [#t (length cenv)] [#f (lookup x cenv)])])) (define (assert-type mask type) `((mov rbx rax) (and rbx ,mask) (cmp rbx ,type) (jne err))) (define assert-integer (assert-type imm-type-mask imm-type-int)) (define assert-box (assert-type result-type-mask type-box)) (define assert-pair (assert-type result-type-mask type-pair)) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;; You don't need to know this, it's bonus fun ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (define (compile-map-zero? e0 c) (let ((c0 (compile-e e0 c)) (lend (gensym "end")) (lz1 (gensym "nz")) (lz2 (gensym "z")) (lempty (gensym "empty")) (loop (gensym "loop_start"))) `(,@c0 (cmp rax ,imm-val-empty) ; if the input list is empty, we're done (je ,lend) ,@assert-pair ; otherwise, it should be a pair (cons) (mov (offset rsp ,(- (add1 (length c)))) rdi) ; store initial heap pointer ,loop ; Start of our loop (xor rax ,type-pair) ; untag the pair (mov rcx rax) ; store pointer in rcx (mov rax (offset rax 1)) ; check that `car` is an integer ,@assert-integer ,@(zero?-asm lz1 lz2) ; rax is zero?, leave the bool in rax (mov (offset rdi 1) rax) ; put the boolean in the car of the new list cell (mov rax (offset rcx 0)) ; move cdr of input into rax (cmp rax ,imm-val-empty) ; if the cdr is the empty list, finish up (je ,lempty) ,@assert-pair ; if it's not empty, it better be a pair (mov rdx rdi) ; move a copy of the heap pointer into rdx (add rdx 16) ; figure out where the next cons is going to be (or rdx ,type-pair) ; tag the next cons as a pair (mov (offset rdi 0) rdx) ; the cdr of our new cons is where the next one goes (add rdi 16) ; bump the heap pointer ; since rax wasn't the empty list ; it's a pointer to out next cons (jmp ,loop) ; iterate ,lempty (mov rax ,imm-val-empty) (mov (offset rdi 0) rax) (mov rax rdi) (add rdi 16) (mov rax (offset rsp ,(- (add1 (length c))))) ; Get initial heap pointer back (or rax ,type-pair) ; tag the initial thing as a pair ,lend)))
12.4 A Run-Time for Hustle
The run-time system for Hustle is more involved for two main reasons:
The first is that the compiler relies on a pointer to free memory residing in 'rdi. The run-time system will be responsible for allocating this memory and initializing the 'rdi register. To allocate memory, it uses malloc. It passes the pointer returned by malloc to the entry function. The protocol for calling functions in C says that the first argument will be passed in the 'rdi register. Since malloc produces 16-byte aligned addresses on 64-bit machines, 'rdi is initialized with an address that ends in #b000, satisfying our assumption about addresses.
The second complication comes from printing. Now that values include inductively defined data, the printer must recursively traverse these values to print them.
The complete run-time system is below.
#include <stdio.h> #include <stdlib.h> #include <inttypes.h> #define result_shift 3 #define result_type_mask ((1 << result_shift) - 1) #define type_imm 0 #define type_box 1 #define type_pair 2 #define imm_shift (3 + result_shift) #define imm_type_mask ((1 << imm_shift) - 1) #define imm_type_int (0 << result_shift) #define imm_val_true (1 << result_shift) #define imm_val_false (2 << result_shift) #define imm_val_empty (3 << result_shift) // in bytes #define heap_size 1000000 int64_t entry(void *); void print_result(int64_t); void print_pair(int64_t); void print_immediate(int64_t); int main(int argc, char** argv) { void * heap = malloc(heap_size); int64_t result = entry(heap); print_result(result); printf("\n"); free(heap); return 0; } void error() { printf("err"); exit(1); } void internal_error() { printf("rts-error"); exit(1); } void print_result(int64_t a) { switch (result_type_mask & a) { case type_imm: print_immediate(a); break; case type_box: printf("#&"); print_result (*((int64_t *)(a ^ type_box))); break; case type_pair: printf("("); print_pair(a); printf(")"); break; default: internal_error(); } } void print_immediate(int64_t a) { switch (imm_type_mask & a) { case imm_type_int: printf("%" PRId64, a >> imm_shift); break; case imm_val_true: printf("#t"); break; case imm_val_false: printf("#f"); break; case imm_val_empty: printf("()"); break; default: break; internal_error(); } } void print_pair(int64_t a) { int64_t car = *((int64_t *)((a + 8) ^ type_pair)); int64_t cdr = *((int64_t *)((a + 0) ^ type_pair)); print_result(car); if ((imm_type_mask & cdr) == imm_val_empty) { // nothing } else if ((result_type_mask & cdr) == type_pair) { printf(" "); print_pair(cdr); } else { printf(" . "); print_result(cdr); } }