CMSC411 - Chapter 3 Notes

Chapter 3

3.1 Instruction-Level Parallelism: Concepts and Challenges (page 172)

Instruction-level parallelism - using pipelining to overlap the execution of instructions and improve performance, since the instructions can be evaluated in parallel.
Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
Data Dependencies and Hazards (page 174)

Determining how one instruction depends on another is critical to determining how much parallelism exists in a program and how that parallelism can be exploited.
If two instructions are parallel, they can execute simultaneously in a pipeline without causing any stalls, assuming the pipeline has sufficient resources (and hence no structural hazards exist).
Data Dependencies (page 175)

An instruction j is data dependent on instruction i if either of the following is true:

instruction i produces a result that may be used by instruction j, or
instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i (for transitivity, i.e. chains of instructions)

Dependencies that flow through memory locations are more difficult to detect since two addresses may refer to the same location but look different: For example, 100(R4) and 20(R6) may be identical.

Name Dependencies (page 177)

Occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between the instructions associated with that name. There are two types of name dependencies between an instruction i that precedes an instruction j in program order:

And antidependence between instruction i and instruction j occurs when instruction j writes to a register or memory location that instruction i reads. The original ordering must be preserved to ensure that i reads the correct value.
An output dependence occurs when instruction i and instruction j write the same register or memory location. The ordering between the instructions must be preserved to ensure that the value finally written corresponds to instruction j.

Both antidependencies and output dependencies are name dependencies, as opposed to true data dependencies, since there is no value being transmitted between the instructions.

Data Hazards (assume instruction i comes before instruction j in program order) (page 177)

RAW (read after write) - j tries to read a source before i writes it, so j incorrectly gets the old value. For example: a load instruction followed by an integer ALU instruction that directly uses the load result will lead to a RAW hazard.
WAW (write after write) - j tries to write an operand before it is written by i. The writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination. This hazard corresponds to an output dependence.
WAR (write after read) - j tries to write a destination before it is read by i, so i incorrectly gets the new value. This hazard arises from an antidependence.
RAR (read after read) - This is not a hazard.

Control Dependencies (page 178)

A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that instruction i is executed in correct program order and only when it should be.
Control dependence is preserved by implementing control hazard detection that causes stalls.

3.2 Overcoming Data Hazards with Dynamic Scheduling (page 181)

Dynamic Scheduling - the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior. It is good because:

It enables handling some cases when dependencies are unknown at compile time (e.g. because they may involve a memory reference), and that simplifies the compiler.
Allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline

Dynamic Scheduling: The Idea (page 182)

A major limitation of the simple pipelining techniques we discuss in Appendix A is that they all use in-order instruction issue and execution: Instructions are issued in program order, and if an instruction is stalled in the pipeline, no later instructions can proceed. If there are multiple functional units, these units could lie idle.
But we want an instruction to begin execution as soon as its operand is available. Thus, this pipeline does out-of-order execution, which implies out-of-order completion.
Out-of-order execution introduces the possibility of WAR and WAW hazards, which do not exist in the five-stage integer pipeline and its logical extension to an in-order floating-point pipeline.
WAW and WAR hazards are avoided by the use of register renaming.
An exception is imprecise if the processor state when the exception is raised does not look exactly as if the instructions were executed sequentially in strict program order. Imprecise exceptions can occur because of two possibilities:

The pipeline may have already completed instructions that are later in program order than the instruction causing the exception
The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception

To allow out-of-order execution, we essentially split the ID pipe stage of our simple five-stage pipeline into two stages:

Issue - decode instructions, check for structural hazards
Read operands - wait until no data hazards, then read operands

Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependencies.
Tomasulo's algorithm has several major enhancements over scoreboarding.

Dynamic Scheduling Using Tomasulo's Approach (page 184)

Tracks when operands for instructions are available, to minimize RAW hazards, and introduces register renaming, to minimize RAW and WAW hazards.
Designed to overcome long floating point delays.
RAW hazards are avoided by executing an instruction only when its operands are available. WAR and WAW hazards, which arise from name dependencies, are eliminated by register renaming.
Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instructions that depend on an earlier value of an operand.
Register renaming is provided by the provided by the reservation stations, which buffer the operands of instructions waiting to issue, and by the issue logic.
The basic idea is that a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from the register.
The Common Data Bus (CDB) is used to by pass the registers and pass the results from the reservation station directly to the functional units.
The steps an instruction goes through:

Issue - Get the next instruction from the head of the instruction queue. If there is an empty reservation station, stick it in there. If not, there is a structural hazard and the instruction stalls until a station or buffer is freed. This step renames registers, eliminating RAW and WAR hazards.
Execute - Watch the CDB until all the operands needed are in reservation stations. When all the operands are ready, the operation can be executed. By delaying instruction execution until the operands are available, RAW hazards are avoided.

Several instructions could become ready in the same clock cycle for the same functional unit. The unit will have to choose among them. For the floating-point reservation stations, this choice may be made arbitrarily. Loads and stores are different.
Loads and stores require a two step execution process.

1) Compute the effective address when the when the base register is available, and then the effective address is placed in the load or store buffer.
2) Loads in the load buffer execute as soon as the memory unit is available. Stores in the store buffer wait for the value to be stored before being sent to the memory unit.

To preserve exception behavior, no instruction is allowed it initiate execution until all the branches that precede the instruction in program order have completed.

Write Result - When the result is available, write it on the CDB and from there into the registers and into any reservation stations (including store buffers) waiting for this result. Stores also write data to memory during this step: When both the address and data value are available, they are sent to the memory unit and the store completes.

Once an instruction has issued and is waiting for a source operand, it refers to the operand by the reservation number where the instruction that will write the register has been assigned.
Each reservation station has seven fields:

1) Op - The operation to perform on source operands S1 and S2.
2, 3) Q_j, Q_k - The reservation stations that will produce the corresponding source operand.
4, 5) V_j, V_k - The value of the source operands.
6) A - Used to hold information for the memory address calculation for a load or store.
7) Busy - Indicates that this reservation station and its accompanying functional until are occupied.

3.3 Dynamic Scheduling: Examples and the Algorithm (page 189)

Tomasulo's Algorithm: The Details (page 191)

Loads and stores go through a functional unit for effective address computation before proceeding to an independent load of store buffers.
Loads take a second execution step to access memory and then to go to Write Result to send the value from memory to the register file and/or any waiting reservation stations.
Stores complete their execution in the Write Result stage, which writes the result to memory.
Notice that all writes occur in Write Result, whether the destination is a register or memory.

Tomasulo's Algorithm: A Loop-Based Example (page 192)

A load and store can safely be done in different order provided that they access different addresses. If a load and a store access the same address, then either:

the load is before the store in program order and interchanging them results in a WAR hazard, or
the store is before the load in program order and interchanging them results in a RAW hazard.

Similarily, interchanging two stores to the same address results in a WAW hazard.
A major benefit of this scheme is that a dynamically scheduled pipeline can yield very high performance, provided that branches are predicted accurately.
The major drawback to the Tomasulo scheme is that because it is so complex, it requires a large amount of hardware.
In Tomasulo's scheme two different techniques are combined:

1) the renaming of the architectural registers to a larger set of registers
2) the buffering of source operands from the register file.

Tomasulo's scheme is particularly appealing if the designer is:

forced to pipeline an architecture for which it is difficult to schedule code
that has a shortage of registers
or for which the designer wishes to obtain high performance without pipeline-specific compilation.

The key components for enhancing ILP in Tomasulo's algorithm are dynamic scheduling, register renaming, and dynamic memory disambiguation.

3.4 Reducing Branch Costs with Dynamic Hardware Prediction

As the amount of ILP we attempt to exploit grows, control dependencies rapidly become the limiting factor. Although schemes in this section are helpful in processors that try to maintain one instruction per clock, they are critical to any processor that tries to issue more than one instruction per clock for two reasons:

1) Branches will arrive up to n times faster in an n-issue processor, and providing an instruction stream to the processor will probably require that we predict the outcome of the branches
2) Amdahl's Law reminds us that relative impact of the control stalls will be larger with the lower potential CPI in such machines

The goal of all these mechanisms is to allow the processor to resolve the outcome of a branch early, thus preventing control dependencies from causing stalls.
Basic Branch Prediction and Branch-Prediction Buffers (page 197)

The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table
Definition: A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not

This simple 1-bit prediction scheme has a performance shortcoming: Even if a branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken.
To remedy this, 2-bit prediction schemes are often used. A prediction must miss twice before it is changed.
There are n-bit predictors, but they work only slightly better than 2 bit predictors, and so they are usually discarded in favor of 2 bit predictors because they are simpler and require less hardware.

Correlating Branch Predictors (page 200)

It may be possible to improve the prediction accuracy if we also look at the recent behavior of other branches rather than just the branch we are trying to predict.
Definition: Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors (or two-level predictors).
In the general case an (m,n) predictor uses the behavior of the last m branches to choose from 2^m branch predictors, each of which is an n-bit predictor for a single branch. The attraction of this type of correlating branch predictor is that it can yield higher prediction rates than the 2-bit scheme and requires only a trivial amount of hardware. (By using an m bit shift register.)

Tournament Predictors: Adaptively Combining Local and Global Predictors (page 206)

The primary motivation for correlating branch predictors came from the observation that the standard 2-bit predictor using only local information failed on some important branches and that, by adding global information, the performance could be improved.
Tournament predictors take this insight to the next level, by using multiple predictors, usually one based on global information and one based on local information, and combining them with a selector.
Tournament predictors are the most popular form of multilevel branch predictors.
Definition: A multilevel branch predictor uses several levels of branch-prediction tables together with an algorithm for choosing among the multiple predictors.

Alex Baglione @UMCP
CMSC 411 Summer 2002
Hennessey and Patterson, Computer Architecture: A Quantitative Approach; Third Edition
Chapter 3 Notes - for ~~educational~~ agricultural use only

Web Accessibility