Compressing References

Table 3: Size (in bytes) of compressed references

					Move-to-front
								Transients
Benchmark	Simple	Basic	Freq	Cache	Basic	Transients	Use Context	and Context
rt	503,522	480,535	398,303	337,201	301,704	299,159	293,451	291,052
swingall	172,372	159,869	136,241	117,254	110,370	109,211	107,247	106,223
tools	94,293	85,547	71,396	64,417	57,207	56,778	55,408	54,998
icebrowserbean	16,935	14,907	12,945	11,616	10,596	10,550	10,260	10,233
jmark20	18,041	14,497	12,583	9,897	9,879	9,954	9,622	9,658
visaj	124,297	116,316	99,216	84,854	76,585	76,080	74,800	74,400
ImageEditor	25,669	23,473	19,886	16,871	15,834	15,750	15,361	15,323
Hanoi	5,953	4,704	4,245	3,824	3,788	3,794	3,648	3,650
Hanoi_big	3,866	2,973	2,617	2,370	2,316	2,318	2,243	2,242
Hanoi_jax	3,078	2,376	2,112	1,883	1,852	1,874	1,814	1,832
javafig_dashO	22,727	19,963	17,768	16,870	15,954	15,891	15,450	15,380
javafig	27,897	23,285	20,596	19,573	18,199	18,079	17,630	17,481
201_compress	757	516	506	497	461	477	456	470
202_jess	10,032	8,256	6,831	6,347	6,224	6,176	5,969	5,876
205_raytrace	2,603	1,966	1,812	1,762	1,646	1,671	1,550	1,576
209_db	843	575	489	483	466	476	455	467
213_javac	22,338	17,815	15,109	14,325	14,193	14,041	13,622	13,504
222_mpegaudio	4,568	3,440	3,143	2,917	2,706	2,708	2,644	2,674
228_jack	6,025	4,559	4,077	3,993	3,723	3,747	3,521	3,542

Given a structure for the data we which to encode (Section 4), we need a way of encoding a reference to an object we may have seen before. For primitives (ints, doubles, ...), I just encode the value of the object, without bothering to check if I have seen the object previously.

Otherwise, we need an encoding that either says we have never seen the object before, or identifies the previously seen object. If we have never seen the object before, then at that point we encode all of the fields/components of the object.

I consider a number of approaches to encoding references. The basic approach that worked best was to use a move-to-front encoding. In a move-to-front encoding, we maintain an ordered list of all of the objects seen. Whenever a previously seen object is to be transmitted, we transmit the position of the object in the list (1 for the first object in the list) and move the object to the front of the list. To transmit an object not seen previously, we transmit the value 0 and insert the object at the front of the list.

I implemented move-to-front queues using a modified form of a Skiplist [Pug90] (the Skiplist structure was modified so that each link recorded the distance it travels forward in the list). By starting the search for an element at the bottom level of the Skiplist, increasing the level to the appropriate level for traversing the Skiplist, and then using a normal Skiplist traversal, I was able to achieve an expected time bound of $O(\log k)$ to do a move-to-front operation on element k of the queue, regardless of the total number of elements in the queue.

This was all that was needed in the decompressor. In the compressor, we also need a way, given an element we may have seen before, to determine if we have seen the element before and if so, where the element is now in the queue. This was implemented by a hashtable from elements to the Skiplist nodes that store them. Once we are at the Skiplist node for an element, we can walk forward to the end of the list (at each node, follow the highest link out of that node, keeping track of the distance traversed by each link). Knowing the distance to the end of the list and the total size of the list allows us to calculate the distance of the element from the front of the list. These operations can all be done in expected time $O(\log n)$ , where n is the number of elements in the queue.

A move-to-front generally does an excellent job of producing lots of references with small encodings, which then can often be encoded in a single byte and compresses well with a Huffman encoding. However, a move-to-front encoding pretty much destroys any patterns in the object stream (e.g., an aload_0 instruction is often followed by a getfield instruction). I tried using a move-to-front encoding for JVM opcodes, then using zlib on the result, and got much worse compression than using zlib on the original JVM opcodes. The zlib compression scheme both finds repeating patterns and uses a Huffman-like encoding to efficiently transmit a stream of bytes with a nonuniform distribution pattern. Thus, a move-to-front encoding may do an excellent job when zlib cannot find significant repeating patterns to exploit, but do poorly when they exist.

I compared using zlib on the byte stream generated by a move-to-front encoding with using a Arithmetic encoding on the indices generated by a move-to-front scheme. In the Arithmetic encoding, encoding an index that occurs with probably p requires log₂ 1/p bits. Given the hypothesis that a move-to-front encoding destroys references patterns and only produces a skewed probably pattern, we would expect the Arithmetic encoding to do better. In the cases I examined, this expectation was fulfilled. For example, for references to virtual methods in rt.jar, using zlib gave results that were 2% bigger than an Arithmetic encoding.

However, these results do not include the size of the dictionaries for the arithmetic encoding, and arithmetic encoding is rather expensive to compress and decompress. The size of the dictionary would be larger than the savings unless it was fitted to a curve and just the parameters for the curve were encoded. Given the negligible or non-existent benefits and the performance cost ditching the built-in zlib decoder for a arithmetic decoder, I decided that this option wasn't worth pursuing.

Variants

Except where noted, seperate pools were used for virtual, interface, static and special method references, and for static and instance field references. The resulting indicies are encoded as a byte stream and compressed as described in Section 6.

Variant: move-to-front, transients

Variant: move-to-front, use context

I do use a common pool for method names across all method types (virtual, static,...), particularly to avoid creating duplicate constant pool entries. However, under this option, we maintain different MTF queues for each method type.

One complication here is that when a method reference is seen for the first time, it must be inserted into all of the MTF queues where it might be seen later.