Given a structure for the data we which to encode (Section 4), we need a way of encoding a reference to an object we may have seen before. For primitives (ints, doubles, ...), I just encode the value of the object, without bothering to check if I have seen the object previously.
Otherwise, we need an encoding that either says we have never seen the object before, or identifies the previously seen object. If we have never seen the object before, then at that point we encode all of the fields/components of the object.
I consider a number of approaches to encoding references. The basic approach that worked best was to use a move-to-front encoding. In a move-to-front encoding, we maintain an ordered list of all of the objects seen. Whenever a previously seen object is to be transmitted, we transmit the position of the object in the list (1 for the first object in the list) and move the object to the front of the list. To transmit an object not seen previously, we transmit the value 0 and insert the object at the front of the list.
I implemented move-to-front queues using a modified form of a Skiplist [Pug90] (the Skiplist structure was modified so that each link recorded the distance it travels forward in the list). By starting the search for an element at the bottom level of the Skiplist, increasing the level to the appropriate level for traversing the Skiplist, and then using a normal Skiplist traversal, I was able to achieve an expected time bound of to do a move-to-front operation on element k of the queue, regardless of the total number of elements in the queue.
This was all that was needed in the decompressor. In the compressor, we also need a way, given an element we may have seen before, to determine if we have seen the element before and if so, where the element is now in the queue. This was implemented by a hashtable from elements to the Skiplist nodes that store them. Once we are at the Skiplist node for an element, we can walk forward to the end of the list (at each node, follow the highest link out of that node, keeping track of the distance traversed by each link). Knowing the distance to the end of the list and the total size of the list allows us to calculate the distance of the element from the front of the list. These operations can all be done in expected time , where n is the number of elements in the queue.
A move-to-front generally does an excellent job of producing lots of references with small encodings, which then can often be encoded in a single byte and compresses well with a Huffman encoding. However, a move-to-front encoding pretty much destroys any patterns in the object stream (e.g., an aload_0 instruction is often followed by a getfield instruction). I tried using a move-to-front encoding for JVM opcodes, then using zlib on the result, and got much worse compression than using zlib on the original JVM opcodes. The zlib compression scheme both finds repeating patterns and uses a Huffman-like encoding to efficiently transmit a stream of bytes with a nonuniform distribution pattern. Thus, a move-to-front encoding may do an excellent job when zlib cannot find significant repeating patterns to exploit, but do poorly when they exist.
I compared using zlib on the byte stream generated by a move-to-front encoding with using a Arithmetic encoding on the indices generated by a move-to-front scheme. In the Arithmetic encoding, encoding an index that occurs with probably p requires log2 1/p bits. Given the hypothesis that a move-to-front encoding destroys references patterns and only produces a skewed probably pattern, we would expect the Arithmetic encoding to do better. In the cases I examined, this expectation was fulfilled. For example, for references to virtual methods in rt.jar, using zlib gave results that were 2% bigger than an Arithmetic encoding.
However, these results do not include the size of the dictionaries for the arithmetic encoding, and arithmetic encoding is rather expensive to compress and decompress. The size of the dictionary would be larger than the savings unless it was fitted to a curve and just the parameters for the curve were encoded. Given the negligible or non-existent benefits and the performance cost ditching the built-in zlib decoder for a arithmetic decoder, I decided that this option wasn't worth pursuing.
Except where noted, seperate pools were used for virtual, interface, static and special method references, and for static and instance field references. The resulting indicies are encoded as a byte stream and compressed as described in Section 6.
I do use a common pool for method names across all method types (virtual, static,...), particularly to avoid creating duplicate constant pool entries. However, under this option, we maintain different MTF queues for each method type.
One complication here is that when a method reference is seen for the first time, it must be inserted into all of the MTF queues where it might be seen later.