Home Pipeline Hazards Overcoming Hazards: Forwarding DLX Pipeline Simulation

What is DLX?
An Implementation of DLX
The Basic DLX Pipeline

Pipelining

Pipelining is an implementation technique where multiple instructions are overlapped in execution. The computer pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The stages are connected one to the next to form a pipe - instructions enter at one end, progress through the stages, and exit at the other end.

Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline.

Because the pipe stages are hooked together, all the stages must be ready to proceed at the same time. We call the time required to move an instruction one step further in the pipeline a machine cycle . The length of the machine cycle is determined by the time required for the slowest pipe stage. The pipeline designer's goal is to balance the length of each pipeline stage . If the stages are perfectly balanced, then the time per instruction on the pipelined machine is equal to

What is DLX?

DLX is a simple pipeline architecture for CPU. It is mostly used in universities as a model to study pipelining technique.

The architecture of DLX was chosen based on observations about most frequently used primitives in programs. DLX provides a good architectural model for study, not only because of the recent popularity of this type of machine, but also because it is easy to understand.
Like most recent load/store machines, DLX emphasizes

A simple load/store instruction set
Design for pipelining efficiency
An easily decoded instruction set
Efficiency as a compiler target

Operations

There are four classes of instructions:

Load/Store
Any of the GPRs or FPRs may be loaded and stored except that loading R0 has no effect.
ALU Operations
All ALU instructions are register-register instructions.
The operations are :
- add
- subtract
- AND
- OR
- XOR
- shifts
Compare instructions compare two registers (=,!=,<,>,=<,=>).
If the condition is true, these instructions place a 1 in the destination register, otherwise they place a 0.
Branches/Jumps
All branches are conditional.The branch condition is specified by the instruction, which may test the register source for zero or nonzero.
Floating-Point Operations
- add
- subtract
- multiply
- divide

An Implementation of DLX

Implementing the instruction set requires the introduction of several temporary registers that are not part of the architecture.
Every DLX instruction can be implemented in at most five clock cycles. The five clock cycles are

Instruction fetch cycle (IF)
Instruction decode/register fetch (ID)
Execution/Effective address cycle (EX)
Memory access/branch completion cycle (MEM)
Write-back cycle (WB)

Detailed description of each follows:

Instruction fetch cycle (IF):

IR <- MEM[PC]
NPC <- PC +4

Operation:
- Send out the PC and fetch the instruction from memory into the instruction register (IR)
- increment the PC by 4 to address the next sequential instruction
- the IR is used to hold the instruction that will be needed on subsequent clock cycles
- the NPC is used to hold the next sequential PC (program counter)

Instruction decode/register fetch (ID):

A <- Regs[IR_6..10]
B <- Regs[IR_11..15]
Imm <- ((IR₁₆)¹⁶##IR_16..31)

Operation:
- Decode the instruction and access the register file to read the registers.
- the output of the general-purpose registers are read into two temporary registers (A and B) for use in later clock cycles.
- the lower 16 bits of the IR are also sign-extended and stored into the temporary register IMM, for use in the next cycle.
- decoding is done in parallel with reading registers, which is possible because these fields are at a fixed location in the DLX instruction format. This technique is known as fixed-field decoding.

Execution/Effective address cycle (EX):
The ALU operates on the operand prepared in the prior cycle, performing one of four functions depending on the DLX instruction type

Memory reference:
ALUOutput <- A +Imm
Operation: The ALU adds the operands to form the effective address and places the result into the register ALUOutput
Register-Register ALU instruction:
ALUOutput <- A op B
Operation: The ALU performs the operation specified by the opcode on the value in register A and on the value in register B. The result is placed in the register ALUOutput.
Register- Immediate ALU instruction:
ALUOutput <- A op Imm
Operation: The ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the register ALUOutput.
ALUOutput <- NPC + Imm
Cond <- ( A op 0 )

Operation:
-The ALU adds the NPC to the sign-extended immediate value in Imm to compute the address of the branch target.
-Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken.
- the comparison operation op is the relational operator determined by the branch opcode (e.g. op is "==" for the instruction BEQZ)

Memory access/branch completion cycle (MEM):
The only DLX instructions active in this cycle are loads, stores, and branches.

Memory reference:
LMD <- Mem[ALUOutput] or Mem[ALUOutput] <- B
Operation:
-Access memory if needed
- If the instruction is load , data returns from memory and is placed in the LMD (load memory data) register
- If the instruction is store, data from the B register is written into memory.
- In either case the address used is the one computed during the prior cycle and stored in the register ALUOutput
Branch:
if (cond) PC <- ALUOutput
else PC <- NPC
Operation:
- If the instruction branches, the PC is replaced with branch destination address in the register ALUOutput
- Otherwise, PC is replaced with the incremented PC in the register NPC

Write-back cycle (WB):

Register-Register ALU instruction:
Regs[IR_16..20] <- ALUOutput
Register-Immediate ALU instruction:
Regs[IR_11..15] <- ALUOutput
Load instruction:
Regs[IR_11..15] <- LMD
Operation:
- Write the result into the register file, whether it comes from the memory(LMD) or from ALU (ALUOutput)
- the register destination field is in one of two positions depending on the opcode

The Basic DLX Pipeline

DLX datapath with almost no changes by starting a new instruction on each clock cycle. Each of the clock cycles of the DLX datapath now becomes a pipe stage: a cycle in the pipeline.
While each instruction takes five clock cycles to complete, during each clock cycle the hardware will initiate a new instruction and will execute some part of the five different instructions. The typical way to show what is going on is:

Instr Num	1	2	3	4	5	6	7	8	9
instr i	IF	ID	EX	MEM	WB
instr i+1		IF	ID	EX	MEM	WB
instr i+2			IF	ID	EX	MEM	WB
instr i+3				IF	ID	EX	MEM	WB
instr i+4					IF	ID	EX	MEM	WB

Let's check again what happens on every clock cycle of the machine and make sure it does not perform two different operations with the same datapath resource on the same clock cycle. For example, a single ALU can not compute an effective address and perform a subtract operation at the same time.
Fortunately, the simplicity of the DLX instruction set makes resource evaluation relatively easy. The major functional units are used in different cycles and hence overlapping the execution of multiple instructions introduces relatively few conflicts.

There are three observations on which this fact rests:

The basic datapath uses separate instruction and data memories. This eliminates a conflict for a single memory that would arise between instruction fetch and data memory access.
The register file is used in two stages : for reading in ID and for writing in WB. This does mean that we need to perform two reads and one write on every clock cycle. Question for you: What if a read and write are to the same register?
To start a new instruction every clock, we must increment and store the PC every clock , and this must be done during the IF stage in preparation for the next instruction. The problem arises when we conside the effect of branches, which change the PC also , but not until the MEM stage.