Chapter 3
3.1 Instruction-Level Parallelism: Concepts and Challenges (page
172)
-
Instruction-level parallelism - using pipelining to overlap the execution
of instructions and improve performance, since the instructions can be
evaluated in parallel.
-
Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data
hazard stalls + Control stalls
-
Data Dependencies and Hazards (page 174)
-
Determining how one instruction depends on another is critical to determining
how much parallelism exists in a program and how that parallelism can be
exploited.
-
If two instructions are parallel, they can execute simultaneously
in a pipeline without causing any stalls, assuming the pipeline has sufficient
resources (and hence no structural hazards exist).
-
Data Dependencies (page 175)
-
An instruction j is data dependent on instruction i if either of
the following is true:
-
instruction i produces a result that may be used by instruction j, or
-
instruction j is data dependent on instruction k, and instruction k is
data dependent on instruction i (for transitivity, i.e. chains of instructions)
-
Dependencies that flow through memory locations are more difficult to detect
since two addresses may refer to the same location but look different:
For example, 100(R4) and 20(R6) may be identical.
-
Name Dependencies (page 177)
-
Occurs when two instructions use the same register or memory location,
called a name, but there is no flow of data between the instructions
associated with that name. There are two types of name dependencies
between an instruction i that precedes an instruction j in program
order:
-
And antidependence between instruction i and instruction j occurs
when instruction j writes to a register or memory location that instruction
i reads. The original ordering must be preserved to ensure that i
reads the correct value.
-
An output dependence occurs when instruction i and instruction j
write the same register or memory location. The ordering between
the instructions must be preserved to ensure that the value finally written
corresponds to instruction j.
-
Both antidependencies and output dependencies are name dependencies, as
opposed to true data dependencies, since there is no value being transmitted
between the instructions.
-
Data Hazards (assume instruction i comes before instruction j in program
order) (page 177)
-
RAW (read after write) - j tries to read a source before i writes it, so
j incorrectly gets the old value. For example: a load instruction
followed by an integer ALU instruction that directly uses the load result
will lead to a RAW hazard.
-
WAW (write after write) - j tries to write an operand before it is written
by i. The writes end up being performed in the wrong order, leaving
the value written by i rather than the value written by j in the destination.
This hazard corresponds to an output dependence.
-
WAR (write after read) - j tries to write a destination before it is read
by i, so i incorrectly gets the new value. This hazard arises from
an antidependence.
-
RAR (read after read) - This is not a hazard.
-
Control Dependencies (page 178)
-
A control dependence determines the ordering of an instruction, i, with
respect to a branch instruction so that instruction i is executed in correct
program order and only when it should be.
-
Control dependence is preserved by implementing control hazard detection
that causes stalls.
3.2 Overcoming Data Hazards with Dynamic Scheduling (page 181)
-
Dynamic Scheduling - the hardware rearranges the instruction execution
to reduce the stalls while maintaining data flow and exception behavior.
It is good because:
-
It enables handling some cases when dependencies are unknown at compile
time (e.g. because they may involve a memory reference), and that simplifies
the compiler.
-
Allows code that was compiled with one pipeline in mind to run efficiently
on a different pipeline
-
Dynamic Scheduling: The Idea (page 182)
-
A major limitation of the simple pipelining techniques we discuss in Appendix
A is that they all use in-order instruction issue and execution: Instructions
are issued in program order, and if an instruction is stalled in the pipeline,
no later instructions can proceed. If there are multiple functional
units, these units could lie idle.
-
But we want an instruction to begin execution as soon as its operand is
available. Thus, this pipeline does out-of-order execution,
which implies out-of-order completion.
-
Out-of-order execution introduces the possibility of WAR and WAW hazards,
which do not exist in the five-stage integer pipeline and its logical extension
to an in-order floating-point pipeline.
-
WAW and WAR hazards are avoided by the use of register renaming.
-
An exception is imprecise if the processor state when the exception
is raised does not look exactly as if the instructions were executed sequentially
in strict program order. Imprecise exceptions can occur because of
two possibilities:
-
The pipeline may have already completed instructions that are later
in program order than the instruction causing the exception
-
The pipeline may have not yet completed some instructions that are
earlier in program order than the instruction causing the exception
-
To allow out-of-order execution, we essentially split the ID pipe stage
of our simple five-stage pipeline into two stages:
-
Issue - decode instructions, check for structural hazards
-
Read operands - wait until no data hazards, then read operands
-
Scoreboarding is a technique for allowing instructions to execute out of
order when there are sufficient resources and no data dependencies.
-
Tomasulo's algorithm has several major enhancements over scoreboarding.
-
Dynamic Scheduling Using Tomasulo's Approach (page 184)
-
Tracks when operands for instructions are available, to minimize RAW hazards,
and introduces register renaming, to minimize RAW and WAW hazards.
-
Designed to overcome long floating point delays.
-
RAW hazards are avoided by executing an instruction only when its operands
are available. WAR and WAW hazards, which arise from name dependencies,
are eliminated by register renaming.
-
Register renaming eliminates these hazards by renaming all destination
registers, including those with a pending read or write for an earlier
instruction, so that the out-of-order write does not affect any instructions
that depend on an earlier value of an operand.
-
Register renaming is provided by the provided by the reservation stations,
which buffer the operands of instructions waiting to issue, and by the
issue logic.
-
The basic idea is that a reservation station fetches and buffers an operand
as soon as it is available, eliminating the need to get the operand from
the register.
-
The Common Data Bus (CDB) is used to by pass the registers and pass the
results from the reservation station directly to the functional units.
-
The steps an instruction goes through:
-
Issue - Get the next instruction from the head of the instruction queue.
If there is an empty reservation station, stick it in there. If not,
there is a structural hazard and the instruction stalls until a station
or buffer is freed. This step renames registers, eliminating RAW
and WAR hazards.
-
Execute - Watch the CDB until all the operands needed are in reservation
stations. When all the operands are ready, the operation can be executed.
By delaying instruction execution until the operands are available, RAW
hazards are avoided.
-
Several instructions could become ready in the same clock cycle for the
same functional unit. The unit will have to choose among them.
For the floating-point reservation stations, this choice may be made arbitrarily.
Loads and stores are different.
-
Loads and stores require a two step execution process.
-
1) Compute the effective address when the when the base register
is available, and then the effective address is placed in the load or store
buffer.
-
2) Loads in the load buffer execute as soon as the memory unit is
available. Stores in the store buffer wait for the value to be stored
before being sent to the memory unit.
-
To preserve exception behavior, no instruction is allowed it initiate execution
until all the branches that precede the instruction in program order have
completed.
-
Write Result - When the result is available, write it on the CDB and from
there into the registers and into any reservation stations (including store
buffers) waiting for this result. Stores also write data to memory
during this step: When both the address and data value are available, they
are sent to the memory unit and the store completes.
-
Once an instruction has issued and is waiting for a source operand, it
refers to the operand by the reservation number where the instruction that
will write the register has been assigned.
-
Each reservation station has seven fields:
-
1) Op - The operation to perform on source operands S1 and S2.
-
2, 3) Qj, Qk - The reservation stations that will
produce the corresponding source operand.
-
4, 5) Vj, Vk - The value of the source operands.
-
6) A - Used to hold information for the memory address calculation for
a load or store.
-
7) Busy - Indicates that this reservation station and its accompanying
functional until are occupied.
3.3 Dynamic Scheduling: Examples and the Algorithm (page 189)
-
Tomasulo's Algorithm: The Details (page 191)
-
Loads and stores go through a functional unit for effective address computation
before proceeding to an independent load of store buffers.
-
Loads take a second execution step to access memory and then to go to Write
Result to send the value from memory to the register file and/or any waiting
reservation stations.
-
Stores complete their execution in the Write Result stage, which writes
the result to memory.
-
Notice that all writes occur in Write Result, whether the destination is
a register or memory.
-
Tomasulo's Algorithm: A Loop-Based Example (page 192)
-
A load and store can safely be done in different order provided that they
access different addresses. If a load and a store access the same
address, then either:
-
the load is before the store in program order and interchanging them results
in a WAR hazard, or
-
the store is before the load in program order and interchanging them results
in a RAW hazard.
-
Similarily, interchanging two stores to the same address results in a WAW
hazard.
-
A major benefit of this scheme is that a dynamically scheduled pipeline
can yield very high performance, provided that branches are predicted accurately.
-
The major drawback to the Tomasulo scheme is that because it is so complex,
it requires a large amount of hardware.
-
In Tomasulo's scheme two different techniques are combined:
-
1) the renaming of the architectural registers to a larger set of registers
-
2) the buffering of source operands from the register file.
-
Tomasulo's scheme is particularly appealing if the designer is:
-
forced to pipeline an architecture for which it is difficult to schedule
code
-
that has a shortage of registers
-
or for which the designer wishes to obtain high performance without pipeline-specific
compilation.
-
The key components for enhancing ILP in Tomasulo's algorithm are dynamic
scheduling, register renaming, and dynamic memory disambiguation.
3.4 Reducing Branch Costs with Dynamic Hardware Prediction
-
As the amount of ILP we attempt to exploit grows, control dependencies
rapidly become the limiting factor. Although schemes in this section
are helpful in processors that try to maintain one instruction per clock,
they are critical to any processor that tries to issue more than one instruction
per clock for two reasons:
-
1) Branches will arrive up to n times faster in an n-issue processor, and
providing an instruction stream to the processor will probably require
that we predict the outcome of the branches
-
2) Amdahl's Law reminds us that relative impact of the control stalls will
be larger with the lower potential CPI in such machines
-
The goal of all these mechanisms is to allow the processor to resolve the
outcome of a branch early, thus preventing control dependencies from causing
stalls.
-
Basic Branch Prediction and Branch-Prediction Buffers (page 197)
-
The simplest dynamic branch-prediction scheme is a branch-prediction buffer
or branch history table
-
Definition: A branch-prediction buffer is a small memory indexed
by the lower portion of the address of the branch instruction. The
memory contains a bit that says whether the branch was recently taken or
not
This simple one bit prediction scheme has a performance shortcoming:
Even if a branch is almost always taken, we will likely predict incorrectly
twice, rather than once, when it is not taken.
-
This simple 1-bit prediction scheme has a performance shortcoming: Even
if a branch is almost always taken, we will likely predict incorrectly
twice, rather than once, when it is not taken.
-
To remedy this, 2-bit prediction schemes are often used. A prediction
must miss twice before it is changed.
-
There are n-bit predictors, but they work only slightly better than 2 bit
predictors, and so they are usually discarded in favor of 2 bit predictors
because they are simpler and require less hardware.
-
Correlating Branch Predictors (page 200)
-
It may be possible to improve the prediction accuracy if we also look at
the recent behavior of other branches rather than just the branch we are
trying to predict.
-
Definition: Branch predictors that use the behavior of other branches to
make a prediction are called correlating predictors (or two-level
predictors).
-
In the general case an (m,n) predictor uses the behavior of the last m
branches to choose from 2m branch predictors, each of which
is an n-bit predictor for a single branch. The attraction of this
type of correlating branch predictor is that it can yield higher prediction
rates than the 2-bit scheme and requires only a trivial amount of hardware.
(By using an m bit shift register.)
-
Tournament Predictors: Adaptively Combining Local and Global Predictors
(page 206)
-
The primary motivation for correlating branch predictors came from the
observation that the standard 2-bit predictor using only local information
failed on some important branches and that, by adding global information,
the performance could be improved.
-
Tournament predictors take this insight to the next level, by using multiple
predictors, usually one based on global information and one based on local
information, and combining them with a selector.
-
Tournament predictors are the most popular form of multilevel branch predictors.
-
Definition: A multilevel branch predictor uses several levels of
branch-prediction tables together with an algorithm for choosing among
the multiple predictors.
Alex Baglione @UMCP
CMSC 411 Summer 2002
Hennessey and Patterson, Computer Architecture: A Quantitative
Approach; Third Edition
Chapter 3 Notes - for educational agricultural
use only
Web Accessibility