If
you’ve had a computer architecture course, you’ve definitely heard of CISC and
RISC systems. New VLIW systems seem to
take this to the next step, a Post-RISC era.
This is illustrated much better if you first compare CISC and RISC to
one another. Here is a quick and easy
review for those of you who have only heard the buzz words:
CISC – Complex Instruction Set
Computer. This is where the CPU has an
instruction for just about everything.
Here the idea is to move all of the complexity to the hardware and make
the software much less demanding. The
purpose of this is to decrease code size, even though your CPI may increase.
RISC – Reduced Instruction Set
Computer. A RISC computer takes much of
the complexity away from the hardware of the computer. It instead focuses on the software. Fewer instructions are needed here. Unlike the CISC machine, RISC users prefer
more complex code to decrease the total CPI.
Taking a little closer look we see that any support for
Higher Level Languages (HLLs) is found in the hardware for CISC machines, and
in the software for RISC. Many people
can understand this comparison, but since both do support HLLs it is hard to
tell what most computers now are using.
So, “What kind of computer do I have?” you may ask. Most CPUs on the market today such as AMD
Athlon, Pentium 3 and Pentium 4 cannot really be classified as simply CISC or
RISC. Over the years, the distinction
between these two ideals of architecture has grown smaller and smaller. Most of these newer processors will use many
instructions like a CISC machine, but they will also use strategies such as
pipelining execution which is a RISC idea.
Obviously there are pros and cons for both, so you might as well use as
many of the pros as you can.
When
looking at the evolution of the RISC machine, you can see a stunning similarity
in philosophy with that of a machine which uses VLIW. The basic idea of VLIW (which can be found in earlier pages), is
to put many instructions into one Very Long Instruction Word. This is a Post-RISC idea that seems to
follow the pattern of ISA evolution brilliantly. This reduces the instructions even more. Your code may become a bit denser, but after
the bugs of the 80’s were worked out
this became a valid idea. Since
superscalar implementations are becoming harder to debug, and RISC ISA’s are
limited to an older style instruction set, VLIW will become more popular. For example, IBM was thinking of canceling
their PPC 620 (using a RISC ISA) after they couldn’t get it to work correctly
after a full year of debugging.
In
the 1960’s it became practical to build CPU’s with additional hardware and functional
units (adders, multipliers, etc.) since
transistors were cheap. If
additional hardware was included and utilized it could be possible to execute
multiple instructions in parallel. An
architecture that handles one instruction at a time is called a scalar
architecture. A scalar architecture’s
goal is to complete one instruction per clock cycle. If an architecture can handle multiple instructions at a time it
is called superscalar, and the goal of a superscalar architecture is to complete
as many instructions as possible per clock cycle. For instructions to be handled simultaneously, they need to be
independent of each other. That is to
say there are no dependences between them; that they are dealing with different
registers. VLIW is a type of
superscalar architecture. The key idea
in the superscalar paradigm, and thus VLIW, is instruction level parallelism.
So naturally, the beginning of VLIW processors can be traced back to the
discovery of ways to exploit instruction level parallelism in computer code.
In
1963 the Control Data Corporation built the CDC 6600, a notable computer
system. It was the first computer to
make use of a scoreboard, which was a hardware solution used to schedule
instructions in an efficient manner. In
addition to this it included 10 functional units, several more than strictly
necessary. These additional units an
incrementer and multiplier, could be used to do two of those operations at
once, speeding up the processor.
However the way it achieved this instruction level parallelism was
slightly different than a VLIW processor. At
it's core both the CDC and a VLIW are using multiple functional units to execute
multiple instructions. However the CDC used a scoreboard to designate which discrete instructions could be executed
simultaneously. In a VLIW processor
there is no such hardware, and the instructions are issued one by one, but are
composed of multiple operations. So how does the VLIW hardware know which
instruction to process ? It picks the next one. All scheduling and
optimizations are done during compile time, not at run-time like the
scoreboard. In a VLIW processor, the hardware's job is to run the code,
and little else, in other superscalar architectures, the hardware plays a more
significant role in other areas.
The
precursor to VLIW processors predates the CDC 6600, and is called horizontal
microcode. Initially CPUs were built
with very complicated instruction sets, primarily to ease the task of coding
the assembly language of the programs that ran on the architecture and reducing
static code size. Instead of being
directly implemented in the ISAs, most of these instructions were aliases for a
set of some simpler primitive instructions that were actually executed by the
CPU. The translations between the
assembly language and the ISA instructions were stored on a separate ROM that
worked between the code and the CPU.
This system was called microprogramming and was proposed by Maurice
Wilkes in 1951. This idea can also be
seen as a precursor to the code morphing technology of the Crusoe, which will
be discussed later.
In
1970, several researchers did studies on instruction level parallelism to study
it’s potential to improve computing performance. The work of Tjaden and Flynn concluded that the potential speed
up of ILP was only by a factor of two or three, because of the lack of
independent instructions in a basic block of code. This was dubbed Flynn’s bottleneck, and stifled research into
instruction level parallelism and VLIW for close to a decade. However, while their methods were sound,
their conclusions were not, since they had neglected the possibility of code
motion, moving instructions between basic blocks separated by branches.
The
some of the renewed possibility of VLIW and ILP yielding significant
performance gains can be attributed to John Fischer who in 1979 was responsible
for creating trace scheduling. Trace
scheduling in the basic sense scans through code and looks for blocks of code
that are independent. If such blocks of
code exist, then these sets of instructions can be run in parallel, as if they
were two separate processes. The
process is actually more complicated, and is broken down into two steps. The first step, trace selection, combines
basic blocks of code into a straight-line sequence. This sequence can include portions of code from loops or
branches. After this code is generated,
it then enters the next stage of trace scheduling, called trace
compaction. In this stage the sequence
is flattened out into a smaller sequence of wider (parallel) instructions. This is important as one of the concerns
with parallel computing is keeping all the additional hardware busy. VLIW machines are “dumb processors” and lack
much of the sophisticated scheduling hardware that conventional superscalar
processors have. The capability to generate parallelism statically was an
important step to making VLIW a viable design strategy. It’s no wonder that John Fisher called one
of his papers, “Parallel Computing: Smart compiler, dumb processor.” This advance led to the development of the
ELI-512 processor and the Bulldog trace-scheduling compiler.
Most
early VLIW machines should really be called LIW machines, as the amount of
instructions that were formed into words usually wasn’t more than two. The Floating Point Systems AP-120B was an
example of such a computer. Allen
Charlesworth and Glen Culler developed this computer.
In 1984
several companies made VLIW processors more advanced than anything before seen
on the market. John Fisher, the
inventor of trace scheduling, created his own computer company, Multiflow. Multiflow produced the Trace/200, Trace/300
and finally Trace/500. The Trace computers handled instruction words that were
7, 14 or 28 operations wide and used trace scheduling during compilation to
extract instruction level parallelism.
Another VLIW innovator, Bob Rau, formed the company Cydrome and produced
the Cydra 5.
In 1986
Culler Systems developed the Culler 7, headed by the owner of the company, Glen
Culler. The Culler 7 had two types of
instructions, A instructions which were basic ALU operations, and X operations
which were microcode programs, that could consist of multiple
instructions. The Culler7 had separate
registers and memories for A and X type instructions and thus achieve
parallelism by concatenating an A and X type instruction into a single word and
executing them together.
Other
VLIW processors were introduced. Intel
experimented with VLIW in the i860. In 1988 Apollo Systems released the Apollo
PRISM which was a 3 wide LIW Processor.
The PRISM was capable of combining a Floating Point Add, Floating Point
Multiply and integer instructions into one.
All of the companies eventually failed, however the new hardware and
compiler techniques were released into the mainstream and provided starting
point for several other chipmakers.
In
1996 Phillips introduced a “media processor” that implemented VLIW, called the
Trimedia chip. The chip possessed
hardware to do audio and video I/O, compression, and image coprocessing. This was a more advanced VLIW architecture. It had 128 General purpose registers and 27
Functional Units. With multiple
function units, instructions can be wider to occupy them. The Trimedia chip also used speculative
execution, which is a common technique in newer VLIW processors. Instead of stalling to get the correct
outcome of the branch, it takes advantage of VLIW’s raw processing power to
execute both outcomes commit the correct one, once it is known. Texas Instruments released a similar DSP
called the TMS320C6x in 1997. It was never
designed to be a general purpose processor, but performed well for digital
signal applications. The TMS320C6x supported variable
length instruction words. It also made
extensive use of static compiler optimizations.
Started in 1995 and released in 2000 by Transmeta, the Crusoe is takes a somewhat different approach to VLIW. It is not targeted at the pure performance market and as such does not implement as complicated of a processor as previous VLIW architectures. While some VLIW processors support out of order issue and dynamic scheduling, the Crusoe does not. While the processor consequently takes a performance hit, it is acceptable. The goal of the Crusoe was to make a processor that was power efficient and maintain 80x86 compatibility. The processor was targeted specifically at the mobile PC market where the tradeoff of performance for battery life is acceptable and even desirable. Additionally, because of the simplified hardware and straight forward execution, the Crusoe allows for a smaller die size, lower manufacturing costs, and less heat output. These qualities are beneficial in any target market. In order to achieve some of the performance gains from more complicated VLIW processors the Crusoe uses a dynamic compilation technique called Code Morphing, which translates and optimizes x86 instructions into native Crusoe VLIW instructions during run-time. Note that this is a large and important departure. Every VLIW processor previously could only make use of static code, which had been compiled and optimized for that specific processor. Some optimizations can only performed during runtime, primarily optimizations dealing with memory references, so VLIW processors had to make due without them. This circumvents a large hurdle of generation compatibilty, that is since the compiler does all the optimization, running it on a later generation of Crusoe chip, which may have a different hardware organization. Since with code morphing, the code is compiled on the fly, by the Code Morpher on the chip, it will always be compiled correctly for that processor. Code Morphing is a huge advance, and basically allows a VLIW processor to retain its simplified hardware, but still behave as if it had dynamic scheduling hardware for optmizations.
Two
other processors on the horizon make use of VLIW techniques. The first is the SUN MAJC. The SUN MAJC (Microprocessor Architecture
for Java Computing) is a VLIW processor designed to run Java Code. It is an ambitious design that employs some
interesting technologies. First it has
four functional units, but the units are data type agnostic, meaning that any
instruction can execute on any of the four functional units, allowing a maximum
of four instructions to be executed in parallel, regardless of the instruction
types. The register in the MAJC are
also data type agnostic, so there are no floating point registers, versus
general registers versus vector registers.
It also has variable length instructions, so the final instruction word
can contain one, two, three or four operations. This granularity reduces static code size which can increase if
the extra length of the instruction word is wasted. Where the Crusoe has Code Morphing, which dynamically translates
x86 instructions into native VLIW code, the MAJC uses JIT, Just-in-Time
compilation, which fulfils the same function for Java code. The MAJC will actually have four VLIW
processors on die, each can handle a separate execution thread. To scale the processor, more processors will
be added to handle more execution threads simultaneously. The MAJC can also pipeline and speculatively
execute threads, something quite original.
The
second soon-to-be-released VLIW processor is the IA-64. The IA-64, also called the Itanium, is a new
64-bit processor developed by Intel and HP.
It executes VLIW instructions that are subject to significant hardware
and software optimization. The IA-64 is
statically optimized in a compiler, but there is no software layer like JIT or
Code Morphing that runs during runtime.
Instead these benefits of these features, advanced dynamic scheduling
algorithms, are implemented in hardware. This
makes it somewhat of a hybrid. VLIW machines aren't supposed to have
hardware handle scheduling and optimization, but the IA-64 does. However
it still utilizes a long instruction word and a compiler intended for VLIW
processors. The IA-64 also supports variable length instruction words, and
interestingly makes no limit to there length.
So if a processor had enough functional units to handle ten independent
instructions, they could be assembled into a ten-operation long instruction
word, a very long instruction word indeed.
The IA-64 avoids empty operations, no-ops, at all costs, so it implements
features like speculative execution and predication, which occupy the
processors time in hopes of getting faster execution.
Just
as RISC was able to achieve performance gains by learning from the mistakes of
CISC and creating a simpler fundamental architecture, superscalar
architectures seem to be another step forward.
If increasing amounts of optimization can be offloaded to software and
increasing amount of parallelism can be found, the raw speed and simplicity of
VLIW will distinguish it from other superscalar architectures.
Code
Morphing, in some ways, is not a new idea.
Ever since the IBM 360/91 family, the idea to have binary compatibly
through the abstraction of an instruction set architecture has existed. Though in the 1960’s it was via
microcode. That idea was developed back
in the 1950’s and 1960s. Other parts of
code morphing are slightly younger.
Code morphing does dynamic scheduling, speculation, loop unrolling, and
other optimizations that have been seen before. However there are some features regarding Code Morphing, which
are fairly revolutionary in processor design.
The
first is translation. The idea behind
code morphing is to emulate another instruction set, converting those
instructions into native code. There
are several benefits, to this. The
first is that it seems to solve the problem of legacy software, by having
legacy code be able to be translated to take advantage of the modern
processor. It also provides another
layer of abstraction. This serves to
make the code morphing software and the processor upgradeable without the need
to recompile older binaries. Another
benefit is that it actually translates the code, and when the code has been
translated and optimized, it is stored in a translation cache, which saves the
overhead of reconverting common instructions.
The
second is the fact the Code Morphing software does this translation on the
fly. This technique is called dynamic
compilation and is relatively new. A
common example of this is Just-in-Time compilation, that accelerates Java
bytecode execution. The advantage to
dynamic compilation is that the compilation hence the optimization occurs at
runtime, when more information is know about the behavior of the program. That means that optimizations are more
effective. It allows the software to
perform tricks that were once reserved for hardware dynamic scheduling. When it can done in software it amounts to a
smaller, simpler, cooler chip, and goes a far way to the streamlining, less-is-more
paradigm of RISC computing.
While
Code Morphing my not have many precursors, it seems to have company. Both HP’s Dynamo and SUN’s JIT are similar
technologies. HP’s Dynamo runs at the
user level of the system, as opposed to Code Morphing, which runs under the
OS. The Dynamo converts HP-8000 code
into, well, HP 8000 code. While this
might initially seem strange, the advantage of dynamic compilation and
translation seems to have noticeable performance benefits, even with the
overhead of the Dynamo translator running.
Some programs experienced a 20% speedup, though more testing is needed
before conclusive performance gains can be known.
SUN’s
MAJC is to Transmeta’s Crusoe, as Just-in-Time compilation is to Code
Morphing. JIT is another method of
dynamic compilation, however instead of translating 80x86 instructions, JIT
translates Java code, which the MAJC would be primarily running.
Finally,
IBM has been developing a scheme called DAISY (Dynamically Architectured
Instruction Set from Yorktown). DAISY
draws heavily on the idea of a execution tree which maps how the program can
execute. The program uses condition
registers which track the status of the execution, only instructions on the
path of execution are translated and compacted into VLIW instructions.
Dynamic compilation provides a powerful ability: to optimize code on the fly in software, and potentially to translate code from one ISA to another. This makes this technique one that many future processors may make use of.