When we dealing with multiple arrays, with some arrays accessed by rows
and some by columns. Storing the arrays row-by-row or column-by-column does
not solve the problem because both rows and columns are used in every
iteration of the loop. We must bring the same data into the cache again and
again if the cache is not large enough to hold all the data, which is a
waste. We will use a matrix multiplication (C = A.B, where A, B, and C are
respectively m x p, p x n, and m x n matrices) as an example to show how to
utilize the locality to improve cache performance.
1. Principle of Locality
Since code is generally executed sequentially, virtually all programs
repeat sections of code and repeatedly access the same or nearby data. This
characteristic is embodied in the Principle of Locality, which has been
found empirically to be obeyed by most programs. It applies to both
instruction references and data references, though it is more likely in
instruction references. It has two main aspects:
- Temporal locality (locality in time) -- individual locations, once
referenced, are likely to be referenced again in the near future.
- Spatial locality (locality in space) - references, including the next
location, are likely to be near the last reference.
Temporal locality is found in instruction loops, data stacks and
variable accesses. Spatial locality describes the characteristic that
programs access a number of distinct regions. Sequential locality describes
sequential locations being referenced and is a main attribute of program
construction. It can also be seen in data accesses, as data item are often
stored in sequential locations.
- Taking advantage of temporal locality
When instructions are formed into loops which are executed many times,
the length of a loop is usually quite small. Therefore once a cache is
loaded with loops of instructions from the main memory, the instructions
are used more than once before new instructions are required from the main
memory. The same situation applies to data; data is repeatedly accessed.
Suppose the reference is repeated n times in all during a program loop and
after the first reference, the location is always found in the cache, then
the average access time would be:
ta = (n*tcache + tmain)/n = tcache +
tmain/n
where n = number of references. As n increases, the average access time
decreases. The increase in speed will, of course, depend upon the program.
Some programs might have a large amount of temporal locality, while others
have less. We can do some optimization about this.
- Taking advantage of spatial locality
To take advantage of spatial locality, we will transfer not just one
byte or word from the main memory to the cache (and vice versa) but a
series of sequential locations called a block. We have assumed that it is
necessary to reference the cache before a reference is make to the main
memory to fetch a word, and it is usual to look into the cache first to see
if the information is held there.
2. Data Blocking
For the matrix multiplication C = A.B, if we made code as below:
For (I = 0; I < m; I++)
For (J = 0; J < n; J = J++) {
R = 0;
For (K = 0; K < p; K++)
R = R + A[I][K] * B[K][J];
C[I][J] = R; }
The two inner loops read all p by n elements of B and access the same p
elements in a row of A repeatedly, and write one row of n elements of C.
The number of capacity misses clearly depends on the dimension parameters:
m, n, p and the size of the cache. If the cache can hold all three metrics,
then all is well, provided there are no cache conflicts. In the worst case,
there would be (2*m*n*p + m*n) words read form memory for m*n*p operations.
To enhance the cache performance if it is not big enough, we use an
optimization technique: blocking. The block method for this matrix product
consist of:
- Split result matrix C into blocks CI,J of size Nb x
Nb, each
blocks is constructed into a continuous array Cb which is then copied back
into the right CI,J.
- Matrices A and B are spit into panels AI and BJ of
size (Nb x p) and (p x Nb) each panel is copied into continuous arrays
Ab
and Bb. The choice of Nb must ensure that Cb, Ab and
Bb fit into one level
of cache, usually L2 cache.
Then we rewrite the code as:
For (I = 0; I < m/Nb; I++){
Ab =
AI;
For (J = 0; J < n/Nb; J++) {
Bb = BJ; Cb = 0;
For (K = 0; K < p/Nb; K++)
Cb = Cb + AbK*BKb;
CI,J = Cb;
}} here "=" means assignment for matrix
We suppose for simplicity
that Nb divides m, n and p. The figure below may help you in understanding
operations performed on blocks. In the case of previous algorithm matrix A
is loaded only one time into cache compared to the n times access of the
original one, while matrix B is still accessed m times. This simple block
method greatly reduce memory access and real codes may choose by looking at
matrix size which loop structure (ijk vs. jik) is best appropriate and if
some matrix operand fits totally into cache.
In the previous we do not talk about L1 cache use. In fact L1 will be
generally too small to handle a CI,J block and one panel of A and B, but
remember that operation performed at Cb = Cb + AbK*BKb is a matrix-matrix
product so each operand AbK and BKb is aceessed Nb times: this part could
also use a block method. Since Nb is relatively small, the implementation
may load only one of Cb, AbK, BKb into L1 cache and works with others from
L2.