CISC vs. RISC Introduction

 

            If you’ve had a computer architecture course, you’ve definitely heard of CISC and RISC systems.  New VLIW systems seem to take this to the next step, a Post-RISC era.  This is illustrated much better if you first compare CISC and RISC to one another.  Here is a quick and easy review for those of you who have only heard the buzz words:

 

CISC – Complex Instruction Set Computer.  This is where the CPU has an instruction for just about everything.  Here the idea is to move all of the complexity to the hardware and make the software much less demanding.  The purpose of this is to decrease code size, even though your CPI may increase.

 

RISC – Reduced Instruction Set Computer.  A RISC computer takes much of the complexity away from the hardware of the computer.  It instead focuses on the software.  Fewer instructions are needed here.  Unlike the CISC machine, RISC users prefer more complex code to decrease the total CPI.

 

            Taking a little closer look we see that any support for Higher Level Languages (HLLs) is found in the hardware for CISC machines, and in the software for RISC.  Many people can understand this comparison, but since both do support HLLs it is hard to tell what most computers now are using.  So, “What kind of computer do I have?” you may ask.  Most CPUs on the market today such as AMD Athlon, Pentium 3 and Pentium 4 cannot really be classified as simply CISC or RISC.  Over the years, the distinction between these two ideals of architecture has grown smaller and smaller.  Most of these newer processors will use many instructions like a CISC machine, but they will also use strategies such as pipelining execution which is a RISC idea.  Obviously there are pros and cons for both, so you might as well use as many of the pros as you can.

 

            When looking at the evolution of the RISC machine, you can see a stunning similarity in philosophy with that of a machine which uses VLIW.  The basic idea of VLIW (which can be found in earlier pages), is to put many instructions into one Very Long Instruction Word.  This is a Post-RISC idea that seems to follow the pattern of ISA evolution brilliantly.  This reduces the instructions even more.  Your code may become a bit denser, but after the bugs of the 80’s were worked out this became a valid idea.  Since superscalar implementations are becoming harder to debug, and RISC ISA’s are limited to an older style instruction set, VLIW will become more popular.  For example, IBM was thinking of canceling their PPC 620 (using a RISC ISA) after they couldn’t get it to work correctly after a full year of debugging.

 

The Dawn of Superscalar

 

            In the 1960’s it became practical to build CPU’s with additional hardware and functional units (adders, multipliers, etc.) since transistors were cheap.  If additional hardware was included and utilized it could be possible to execute multiple instructions in parallel.  An architecture that handles one instruction at a time is called a scalar architecture.   A scalar architecture’s goal is to complete one instruction per clock cycle.  If an architecture can handle multiple instructions at a time it is called superscalar, and the goal of a superscalar architecture is to complete as many instructions as possible per clock cycle.  For instructions to be handled simultaneously, they need to be independent of each other.  That is to say there are no dependences between them; that they are dealing with different registers.  VLIW is a type of superscalar architecture.  The key idea in the superscalar paradigm, and thus VLIW, is instruction level parallelism.  So naturally, the beginning of VLIW processors can be traced back to the discovery of ways to exploit instruction level parallelism in computer code.

 

Exploiting ILP

 

            In 1963 the Control Data Corporation built the CDC 6600, a notable computer system.  It was the first computer to make use of a scoreboard, which was a hardware solution used to schedule instructions in an efficient manner.  In addition to this it included 10 functional units, several more than strictly necessary.  These additional units an incrementer and multiplier, could be used to do two of those operations at once, speeding up the processor.  However the way it achieved this instruction level parallelism was slightly different than a VLIW processor.  At it's core both the CDC and a VLIW are using multiple functional units to execute multiple instructions.  However the CDC used a scoreboard to designate which discrete instructions could be executed simultaneously.  In a VLIW processor there is no such hardware, and the instructions are issued one by one, but are composed of multiple operations.  So how does the VLIW hardware know which instruction to process ?  It picks the next one.  All scheduling and optimizations are done during compile time, not at run-time like the scoreboard.  In a VLIW processor, the hardware's job is to run the code, and little else, in other superscalar architectures, the hardware plays a more significant role in other areas.

            The precursor to VLIW processors predates the CDC 6600, and is called horizontal microcode.  Initially CPUs were built with very complicated instruction sets, primarily to ease the task of coding the assembly language of the programs that ran on the architecture and reducing static code size.  Instead of being directly implemented in the ISAs, most of these instructions were aliases for a set of some simpler primitive instructions that were actually executed by the CPU.  The translations between the assembly language and the ISA instructions were stored on a separate ROM that worked between the code and the CPU.  This system was called microprogramming and was proposed by Maurice Wilkes in 1951.  This idea can also be seen as a precursor to the code morphing technology of the Crusoe, which will be discussed later.

 

The End of Superscalar ?

 

            In 1970, several researchers did studies on instruction level parallelism to study it’s potential to improve computing performance.  The work of Tjaden and Flynn concluded that the potential speed up of ILP was only by a factor of two or three, because of the lack of independent instructions in a basic block of code.  This was dubbed Flynn’s bottleneck, and stifled research into instruction level parallelism and VLIW for close to a decade.  However, while their methods were sound, their conclusions were not, since they had neglected the possibility of code motion, moving instructions between basic blocks separated by branches.

 

The Possibility of VLIW

 

The some of the renewed possibility of VLIW and ILP yielding significant performance gains can be attributed to John Fischer who in 1979 was responsible for creating trace scheduling.  Trace scheduling in the basic sense scans through code and looks for blocks of code that are independent.  If such blocks of code exist, then these sets of instructions can be run in parallel, as if they were two separate processes.  The process is actually more complicated, and is broken down into two steps.  The first step, trace selection, combines basic blocks of code into a straight-line sequence.  This sequence can include portions of code from loops or branches.  After this code is generated, it then enters the next stage of trace scheduling, called trace compaction.  In this stage the sequence is flattened out into a smaller sequence of wider (parallel) instructions.  This is important as one of the concerns with parallel computing is keeping all the additional hardware busy.  VLIW machines are “dumb processors” and lack much of the sophisticated scheduling hardware that conventional superscalar processors have. The capability to generate parallelism statically was an important step to making VLIW a viable design strategy.  It’s no wonder that John Fisher called one of his papers, “Parallel Computing: Smart compiler, dumb processor.”   This advance led to the development of the ELI-512 processor and the Bulldog trace-scheduling compiler.

 

First Attempts

    

            Most early VLIW machines should really be called LIW machines, as the amount of instructions that were formed into words usually wasn’t more than two.  The Floating Point Systems AP-120B was an example of such a computer.  Allen Charlesworth and Glen Culler developed this computer.  

In 1984 several companies made VLIW processors more advanced than anything before seen on the market.  John Fisher, the inventor of trace scheduling, created his own computer company, Multiflow.   Multiflow produced the Trace/200, Trace/300 and finally Trace/500. The Trace computers handled instruction words that were 7, 14 or 28 operations wide and used trace scheduling during compilation to extract instruction level parallelism.   Another VLIW innovator, Bob Rau, formed the company Cydrome and produced the Cydra 5. 

In 1986 Culler Systems developed the Culler 7, headed by the owner of the company, Glen Culler.  The Culler 7 had two types of instructions, A instructions which were basic ALU operations, and X operations which were microcode programs, that could consist of multiple instructions.  The Culler7 had separate registers and memories for A and X type instructions and thus achieve parallelism by concatenating an A and X type instruction into a single word and executing them together. 

Other VLIW processors were introduced.  Intel experimented with VLIW in the i860. In 1988 Apollo Systems released the Apollo PRISM which was a 3 wide LIW Processor.  The PRISM was capable of combining a Floating Point Add, Floating Point Multiply and integer instructions into one.  All of the companies eventually failed, however the new hardware and compiler techniques were released into the mainstream and provided starting point for several other chipmakers.

 

Specialized Processors

 

            In 1996 Phillips introduced a “media processor” that implemented VLIW, called the Trimedia chip.  The chip possessed hardware to do audio and video I/O, compression, and image coprocessing.  This was a more advanced VLIW architecture.  It had 128 General purpose registers and 27 Functional Units.  With multiple function units, instructions can be wider to occupy them.  The Trimedia chip also used speculative execution, which is a common technique in newer VLIW processors.  Instead of stalling to get the correct outcome of the branch, it takes advantage of VLIW’s raw processing power to execute both outcomes commit the correct one, once it is known.  Texas Instruments released a similar DSP called the TMS320C6x in 1997.  It was never designed to be a general purpose processor, but performed well for digital signal applications.  The TMS320C6x supported variable length instruction words.  It also made extensive use of static compiler optimizations.        

 

Present and Future – Embedded and General Purpose Processors

 

Started in 1995 and released in 2000 by Transmeta, the Crusoe is takes a somewhat different approach to VLIW.  It is not targeted at the pure performance market and as such does not implement as complicated of a processor as previous VLIW architectures.  While some VLIW processors support out of order issue and dynamic scheduling, the Crusoe does not.  While the processor consequently takes a performance hit, it is acceptable.  The goal of the Crusoe was to make a processor that was power efficient and maintain 80x86 compatibility.  The processor was targeted specifically at the mobile PC market where the tradeoff of performance for battery life is acceptable and even desirable.  Additionally, because of the simplified hardware and straight forward execution, the Crusoe allows for a smaller die size, lower manufacturing costs, and less heat output.  These qualities are beneficial in any target market.  In order to achieve some of the performance gains from more complicated VLIW processors the Crusoe uses a dynamic compilation technique called Code Morphing, which translates and optimizes x86 instructions into native Crusoe VLIW instructions during run-time.  Note that this is a large and important departure.  Every VLIW processor previously could only make use of static code, which had been compiled and optimized for that specific processor.  Some optimizations can only performed during runtime, primarily optimizations dealing with memory references, so VLIW processors had to make due without them.  This circumvents a large hurdle of generation compatibilty, that is since the compiler does all the optimization, running it on a later generation of Crusoe chip, which may have a different hardware organization.  Since with code morphing, the code is compiled on the fly, by the Code Morpher on the chip, it will always be compiled correctly for that processor.  Code Morphing is a huge advance, and basically allows a VLIW processor to retain its simplified hardware, but still behave as if it had dynamic scheduling hardware for optmizations.                 

Two other processors on the horizon make use of VLIW techniques.  The first is the SUN MAJC.  The SUN MAJC (Microprocessor Architecture for Java Computing) is a VLIW processor designed to run Java Code.  It is an ambitious design that employs some interesting technologies.  First it has four functional units, but the units are data type agnostic, meaning that any instruction can execute on any of the four functional units, allowing a maximum of four instructions to be executed in parallel, regardless of the instruction types.  The register in the MAJC are also data type agnostic, so there are no floating point registers, versus general registers versus vector registers.  It also has variable length instructions, so the final instruction word can contain one, two, three or four operations.  This granularity reduces static code size which can increase if the extra length of the instruction word is wasted.  Where the Crusoe has Code Morphing, which dynamically translates x86 instructions into native VLIW code, the MAJC uses JIT, Just-in-Time compilation, which fulfils the same function for Java code.  The MAJC will actually have four VLIW processors on die, each can handle a separate execution thread.  To scale the processor, more processors will be added to handle more execution threads simultaneously.  The MAJC can also pipeline and speculatively execute threads, something quite original.

            The second soon-to-be-released VLIW processor is the IA-64.  The IA-64, also called the Itanium, is a new 64-bit processor developed by Intel and HP.  It executes VLIW instructions that are subject to significant hardware and software optimization.  The IA-64 is statically optimized in a compiler, but there is no software layer like JIT or Code Morphing that runs during runtime.  Instead these benefits of these features, advanced dynamic scheduling algorithms, are implemented in hardware.  This makes it somewhat of a hybrid.  VLIW machines aren't supposed to have hardware handle scheduling and optimization, but the IA-64 does.  However it still utilizes a long instruction word and a compiler intended for VLIW processors.  The IA-64 also supports variable length instruction words, and interestingly makes no limit to there length.  So if a processor had enough functional units to handle ten independent instructions, they could be assembled into a ten-operation long instruction word, a very long instruction word indeed.  The IA-64 avoids empty operations, no-ops, at all costs, so it implements features like speculative execution and predication, which occupy the processors time in hopes of getting faster execution.

            Just as RISC was able to achieve performance gains by learning from the mistakes of CISC and creating a simpler fundamental architecture, superscalar architectures seem to be another step forward.  If increasing amounts of optimization can be offloaded to software and increasing amount of parallelism can be found, the raw speed and simplicity of VLIW will distinguish it from other superscalar architectures. 

 

Code Morphing

 

            Code Morphing, in some ways, is not a new idea.  Ever since the IBM 360/91 family, the idea to have binary compatibly through the abstraction of an instruction set architecture has existed.  Though in the 1960’s it was via microcode.  That idea was developed back in the 1950’s and 1960s.  Other parts of code morphing are slightly younger.  Code morphing does dynamic scheduling, speculation, loop unrolling, and other optimizations that have been seen before.  However there are some features regarding Code Morphing, which are fairly revolutionary in processor design.

            The first is translation.  The idea behind code morphing is to emulate another instruction set, converting those instructions into native code.  There are several benefits, to this.  The first is that it seems to solve the problem of legacy software, by having legacy code be able to be translated to take advantage of the modern processor.  It also provides another layer of abstraction.  This serves to make the code morphing software and the processor upgradeable without the need to recompile older binaries.  Another benefit is that it actually translates the code, and when the code has been translated and optimized, it is stored in a translation cache, which saves the overhead of reconverting common instructions. 

            The second is the fact the Code Morphing software does this translation on the fly.  This technique is called dynamic compilation and is relatively new.  A common example of this is Just-in-Time compilation, that accelerates Java bytecode execution.  The advantage to dynamic compilation is that the compilation hence the optimization occurs at runtime, when more information is know about the behavior of the program.  That means that optimizations are more effective.  It allows the software to perform tricks that were once reserved for hardware dynamic scheduling.  When it can done in software it amounts to a smaller, simpler, cooler chip, and goes a far way to the streamlining, less-is-more paradigm of RISC computing.

            While Code Morphing my not have many precursors, it seems to have company.  Both HP’s Dynamo and SUN’s JIT are similar technologies.  HP’s Dynamo runs at the user level of the system, as opposed to Code Morphing, which runs under the OS.  The Dynamo converts HP-8000 code into, well, HP 8000 code.  While this might initially seem strange, the advantage of dynamic compilation and translation seems to have noticeable performance benefits, even with the overhead of the Dynamo translator running.  Some programs experienced a 20% speedup, though more testing is needed before conclusive performance gains can be known.

            SUN’s MAJC is to Transmeta’s Crusoe, as Just-in-Time compilation is to Code Morphing.   JIT is another method of dynamic compilation, however instead of translating 80x86 instructions, JIT translates Java code, which the MAJC would be primarily running. 

            Finally, IBM has been developing a scheme called DAISY (Dynamically Architectured Instruction Set from Yorktown).  DAISY draws heavily on the idea of a execution tree which maps how the program can execute.  The program uses condition registers which track the status of the execution, only instructions on the path of execution are translated and compacted into VLIW instructions.

 

Dynamic compilation provides a powerful ability: to optimize code on the fly in software, and potentially to translate code from one ISA to another.  This makes this technique one that many future processors may make use of.