It would be helpful to put together a "JSR-133 Cookbook for Compiler
Writers" as an unofficial accompaniment to the JSR-133 specs. The
idea is to provide concrete guidance that compiler writers can use
without having to figure most things out from the underlying models
and specs. The main goal is to help make sure compilers and JVMs
implement JSR-133 correctly. (Actually, my motivation for writing this
is that I've been dealing with interactions of volatiles, locks, and
atomic instructions in preliminary implementation of new JSR-166
features, and wished I had something like this around.)
Here's a draft. Comments, suggestions, and corrections would be very
welcome. After collecting feedback, I'll probably put this up as a
web page.
Contents
1. Reorderings
2. Memory Barriers
3. Multiprocessors
4. Barrier Instructions
5. Interactions with Synchronization
6. Barrier Recipes
1. REORDERINGS
For a compiler writer, The JMM mainly consists of rules disallowing
certain reorderings of instructions that access fields (where "fields"
also refers to array elements). The main JMM "Roach Motel" rules can
be viewed as a matrix:
Can Reorder
2ns insn
NL NS VL or ME VS or MX
1st insn
NL - - - N
NS - - - N
VL or ME N N N N
VS or MX - - N N
where
NL = normal (non-volatile) getfield, getstatic, array load
NS = normal (non-volatile) putfield, putstatic, array store
VL = volatile getfield, getstatic
VS = volatile putfield, putstatic
ME = MonitorEnter, entry to sync method.
MX = Monitor Exit, exit from sync method
The cells for volatile loads are the same as MonitorEnter, and
volatiles stores are same as MonitorExit, so they are collapsed
together here.
Any number of other operations might be present between the indicated
1st and 2nd instructions. So, for example, the N in cell [NS, VS] says
that a non-volatile store cannot be reordered with ANY subsequent
volatile store.
Blank cells in the table mean that the reordering is allowed if the
accesses aren't otherwise dependent with respect to basic Java
semantics. For example even though the table doesn't say so, you can't
reorder a load with a subsequent store to the same location. But you
can if the locations are guaranteed to be distinct, and may wish to do
so in the course of various compiler transformations and
optimizations. However, in all cases reorderings must maintain
minimal Java safety properties even when accesses are incorrectly
synchronized by programmers: All observed field values must be either
the default zero/null "pre-initialization" values, or those written by
some thread. This usually entails zeroing all heap memory holding
objects before it is used in constructors and never reordering other
loads with the zeroing stores.
Loads and Stores of final fields act as "normal" accesses with respect to
locks and volatiles above, but impose two additional reordering rules.
1. A store of a final field (inside a constructor) cannot be reordered
with a subsequent store (outside that constructor) of the reference
to the object holding that field into a variable accessible to
other threads.
That is, you cannot reorder
x.finalField = v; ... ; sharedRef = x;
This comes into play for example when inlining constructors, where you
cannot move stores of finals within constructors down below a store
outside of the constructor that might make the object visible to other
threads. (As seen below, this may also require issuing a barrier).
2. The initial load of a final field cannot be reordered with the
initial load of the reference to the object containing the
final field.
This comes into play in
x = sharedRef; ... ; i = x.finalField;
A compiler would never reorder these since they are dependent (also,
they don't require a barrier). The rule is listed here anyway
to illustrate that reliable use of final fields by Java programmers
requires that the load of sharedRef itself be synchronized, volatile,
or final, or derived from such a load, thus ultimately ordering the
initial stores in constructors with subsequent uses outside
constructors.
2. MEMORY BARRIERS
Compilers and processors must both obey reordering rules. No
particular effort is required to ensure that uniprocessors maintain
proper ordering, since they all guarantee "as-if-sequential"
consistency. But on multiprocessors, conformance requires emitting
barrier instructions. Even if a compiler optimizes away a field
access (for example because a loaded value is not used), barriers must
still be generated as if the access were still present. (Although see
below about independently optimizing away barriers.)
Memory barriers are only indirectly related to higher-level notions
described in JMM specs such as Acquire and Release. And memory
barriers are not themselves "synchronization barriers". And memory
barriers are unrelated to the kinds of "write barriers" used in some
garbage collectors.
Memory barrier instructions directly control only the interaction of a
CPU with its cache, with its write-buffer that holds writes waiting to
be flushed to memory, and/or its buffer of waiting loads or
speculatively executed instructions. (However, these effects may lead
to further interaction among caches, main memory and processors.)
A property of memory barriers that takes some getting used to is that
barrier instructions are usually designed to be placed BETWEEN memory
accesses. Despite the names given for barrier instructions on some
processors, the right/best barrier to use depends on the kinds of
accesses they separate. Here's a common categorization of barrier
types that maps pretty well to specific instructions (sometimes
no-ops) on existing processors:
LoadLoad
"Load1; LoadLoad; Load2" ensures that Load1's data are loaded before
data accessed by Load2 and all subsequent load instructions are
loaded. In general, LoadLoad barriers are needed on processors that
perform speculative loads and/or out-of-order processing.
StoreStore
"Store1; StoreStore; Store2" ensures that Store1's data are flushed
to memory before the data associated with Store2 and all subsequent
store instructions are flushed. In general, StoreStore barriers are
needed on processors that do not otherwise guarantee strict ordering of
flushes from write buffers and/or caches to memory.
StoreLoad
"Store1; StoreLoad; Load2" ensures that Store1's data are flushed to
main memory before data accessed by Load2 and all subsequent load
instructions are loaded. In particular, StoreLoad barriers protect
against a subsequent load incorrectly using Store1's data value
rather than that from a more recent store performed by a different
processor. Because of this, a StoreLoad is strictly necessary only
for separating stores from subsequent loads of the SAME
location(s). StoreLoad barriers are needed on all recent
multiprocessors, and are usually the most expensive kind. Part of
the reason they are expensive is that they must disable mechanisms
that ordinarily bypass cache to satisfy loads from write-buffers.
This may be implemented by letting the buffer fully flush or some
equally time-consuming operations.
LoadStore
"Load1; LoadStore; Store2" ensures that Load1's data are loaded
before all data associated with Store2 and subsequent store
instructions are flushed. No processors appear to need LoadStore
barriers, presumably because they must deal with Load/Store
dependencies for cache control anyway.
Considering that LoadStore is never needed so is always a no-op, the
following table summarizes how barriers can be used to implement
JSR-133 ordering rules.
NL NS VL/ME VS/MX
NL - - - (no-op)
NS - - - StoreStore
VL/ME LoadLoad (no-op) LoadLoad (no-op)
VS/MX - - StoreLoad StoreStore
Plus the special final-field rule requiring a StoreStore barrier in
x.finalField = v; StoreStore; sharedRef = x;
On all processors, it turns out that instructions that perform
StoreLoad also obtain the effects of both StoreLoad and LoadLoad. So
StoreLoad can serve as a sort of universal (but usually expensive)
barrier. Note though that the opposite doesn't usually hold. It is NOT
in general the case that issuing both a StoreStore and LoadLoad gives
the equivalent of a StoreLoad.
3. MULTIPROCESSORS
Here's a listing of processors that are commonly used in MPs, along
with URLs where I found most of the imformation (Some require some
clicking around from the listed site). This isn't an exhaustive list,
but it includes those used in all current and near-future
multiprocessor Java implementations I know of. The list and the
properties of processors decribed below are not definitive. In some
cases I'm just reporting what I read, and could have misread. Please
help make it definitive.
sparc-TSO
Ultrasparc 1, 2, 3 (sparcv9) in TSO (Total Store Order) mode
(Ultra3 only supports TSO. RMO in Ultra1/2 is never used so can be ignored.)
UltraSPARCŪ III Cu User's Manual
http://www.sun.com/processors/manuals/index.html
x86-PO
486, Pentium, P2, P3, Athlon and others that do NOT support
"SSE2" instructions.
Intel calls consistency properties for these "Processor Ordering" (PO)
The IA-32 IntelŪ Architecture Software
Developers Manual, Volume 3: System Programming Guide
http://developer.intel.com/design/pentium4/manuals/245472.htm
x86-SPO
P4, P4 with hyperthreading, Xeon, Opteron, and others that
DO support "SSE2" instructions.
Intel calls consistency properties for these "Speculative
Processor Ordering" (SPO)
See above, plus:
AMD x86-64 Architecture Programmer's Manual Volume 2: System Programming
http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_875_7044,00.html
ia64
Itanium
IntelŪ ItaniumŪ Architecture Software Developer's Manual, Volume 2: System Architecture, Revision 2.1
http://developer.intel.com/design/itanium/downloads/245318.htm
alpha
21264x and I think all others
Alpha Architecture Handbook
http://www.support.compaq.com/alpha-tools/documentation/current/chip-docs.html
ppc
6xx, 7xx, 7xxx
MPC603e RISC Microprocessor Users Manual
http://www.motorola.com/PowerPC/
MPC7410/MPC7400 RISC Microprocessor Users Manual
http://www.motorola.com/PowerPC/
PowerPC Microprocessor Family Software Reference Manual
http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600719DF2
ppc64
64bit POWER4 systems
I can't locate online reference manual. For discussion of barriers see:
http://www-1.ibm.com/servers/esdd/articles/power4_mem.html
http://www-1.ibm.com/servers/esdd/articles/powerpc.html
eppc
"Book-E" enhanced powerpc / PowerPC-440 / Motorola-e500 "G5"
Book E- Enhanced PowerPC Architecture
http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600682CC7
EREF: A Reference for Motorola Book E and the e500 Core
http://e-www.motorola.com/webapp/sps/site/overview.jsp?nodeId=03M943030450467M0ys3k3KQ
4. BARRIER INSTRUCTIONS
Here's how JMM reordering rules translate to barrier instructions on
different processors:
Processor LoadLoad StoreStore StoreLoad
sparc-TSO no-op no-op membar(StoreLoad)
x86-PO no-op no-op lock-prefixed-op/cpuid
x86-SPO lfence no-op mfence
ia64 ld.acq st.rel mf
alpha mb wmb mb
ppc isync eieio/sync sync
ppc64 isync eieio/lwsync sync
eppc isync mbar/msync msync
Notes:
* Some of the listed instructions have stronger barrier properties
than actually needed in the indicated cells, but seem to be the
cheapest way to get desired effects.
* The listed instructions are those designed for use with normal
program memory, but not necessarily other special forms/modes of
caching and memory used for IO and system tasks. For example, on
x86-SPO, StoreStore barriers ("sfence") are needed with
WriteCombining (WC) caching mode. But this mode is designed for
use in system-level bulk transfers etc. OSes use Writeback mode
for programs and data, which doesn't require StoreStore barriers.
* Some forms of some barriers on some processors have side effects
on registers or status flags that may have the effect here of
increasing their effective cost.
* On x86 (both PO and SPO), any lock-prefixed instruction can be
used as a StoreLoad barrier. The cpuid instruction also works but
is slower. On x86-SPO, mfence is usually better than either unless
a lock-prefixed instruction like CAS is needed anyway.
* On ia64, LoadLoad and StoreStore barriers are folded into special
forms of load and store instructions -- there aren't separate
LoadLoad or StoreStore instructions. ld.acq acts as (load;
LoadLoadBarrier) and st.rel acts as (StoreStoreBarrier; store).
Note that neither of these provide a StoreLoad barrier -- you
need a separate mf instruction for that.
* All ppcs have the same basic (weak) consistency properties. Newer
ones seem to differ mainly in providing somewhat cheaper barrier
instructions (as well as further options for special caching modes
that aren't used for program data). I don't know when (if ever)
each of the options listed for StoreStore in ppcs is best, but I
think it is the first one listed in each cell.
5. INTERACTIONS WITH SYNCHRONIZATION
The kinds of barriers needed on different processors further interact
with the implementation of MonitorEnter and MonitorExit. Locking and
unlocking usually entails the use of atomic operations compareAndSwap
(CAS) or LoadLinked/StoreConditional (LL/SC) that have the semantics
of performing a volatile load followed by a volatile store. On all
processors, atomic operations protect against read-after-write
problems for the locations being conditionally updated. (Otherwise
standard loop-until-success constructions wouldn't work in the desired
way.) However, they differ in whether they provide more general
barrier properties as well. On some processors these instructions also
intrinsically perform barriers that would otherwise be needed for
MonitorEnter/Exit; on others some or all of these barriers must be
specifically issued.
Volatiles and Monitors have to be separated to disentangle these
effects, giving:
NL NS VL VS ME MX
NL - - - (no-op) - (no-op)
NS - - - StoreStore - StoreExit
VL LoadLoad (no-op) LoadLoad (no-op) LoadEnter (no-op)
VS - - StoreLoad StoreStore StoreEnter StoreExit
ME EnterLoad (no-op) EnterLoad (no-op) EnterEnter (no-op)
MX - - ExitLoad ExitStore ExitEnter ExitExit
Plus the special final-field rule requiring a StoreStore barrier in
x.finalField = v; StoreStore; sharedRef = x;
In this table, "Enter" is the same as "Load" and "Exit" is the same as
"Store", unless overridden by the use and nature of atomic
instructions. In particular:
EnterLoad is the same LoadLoad unless an atomic instruction is
used in MonitorEnter and itself provides a barrier with at least
the properties of LoadLoad, in which case it is a no-op.
StoreExit is the same as StoreStore unless an atomic instruction
is used in MonitorExit and itself provides a barrier with at least
the properties of StoreStore, in which case it is a no-op.
The types other than EnterLoad and StoreExit are unlikely to play a role
in compilation (see below) but are listed for completeness.
Java-level access to atomic conditional update operations will be
available in JDK1.5 via JSR-166 (see
http://gee.cs.oswego.edu/dl/concurrency-interest/) so compilers will
need to issue associated code, using a variant of the above table that
collapses ME and MX, and requires that they be implemented using
atomic instructions.
Here's how atomic operations are supported on multiprocessors:
Processor Flavor Instructions Barrier?
sparc-TSO CAS casa full
x86-PO CAS cmpxchg full
x86-SPO CAS cmpxchg full
ia64 CAS cmpxchg target + acq or rel
alpha LL/SC ldx_l/stx_c target only
ppc LL/SC ldarx/stwcx target only
ppc64 LL/SC ldarx/stwcx target only
eppc LL/SC ldarx/stwcx target only
Notes:
* On sparc and x86, CAS has implicit preceeding and trailing full
StoreLoad barriers. The sparcv9 architecture manual says CAS need
not have post-StoreLoad barrier property, but the chip manuals
indicate that it does on ultrasparcs.
* On ppc and alpha, LL/SC have implicit barriers only with respect
to the locations being loaded/stored, but don't have more general
barrier properties.
* The ia64 cmpxchg instruction also has implicit barriers with
respect to the locations being loaded/stored, but additionally
takes an optional .acq (post-LoadLoad) or .rel (pre-StoreStore)
modifier. For example, cmpxchg.acq can be used for MonitorEnter,
and cmpxchg.rel for MonitorExit. There is not a form providing
StoreLoad.
* Some processors also have additional atomic instructions, for
example fetch-and-add on x86 and ia64, that typically have the
same memory properties as CAS.
JSR-133 also addresses a few other issues that may entail barriers in
more specialized cases involving synchronization:
* Thread.start() and Thread exit require barriers that are
normally generated by the synchronization entailed in
implementations of these constructs.
* Static final initialization may require barriers that are
normally needed anyway to obey Java class loading and
initialization rules.
* Ensuring default zero/null initial field values normally
entails barriers, synchronization, and/or low-level
cache control within garbage collectors.
* Calls to and returns from JNI routines may require barriers,
although this seems to be a quality of implementation issue.
* Most processors have other synchronizing instructions designed
primarily for use with IO and OS actions. These don't impact JMM
issues directly, but may be involved in IO, class loading, and
dynamic code generation.
6. BARRIER RECIPES
Since barrier instructions apply between different kinds of accesses,
finding an "optimal" placement that minimizes the total number of
executed barriers all but impopssible. Compilers often cannot tell if
a given load or store will be preceded or followed by another that
requires a barrier; for example, when a volatile store is followed by
a return. The easiest conservative strategy is to generate barriers
assuming that the kind of access requiring the "heaviest" kind of
barrier occur when generating code for any given load or store:
0. If on a uniprocessor:
* Don't generate barrier instructions, unless object memory is
somehow shared with asynchrononously accessible IO memory (or,
conceivably, if context switching doesn't entail sufficient
synchronization).
1. On all processors (since all require StoreLoad barriers):
* Issue a StoreLoad barrier after each volatile store.
(Note that you could instead issue one before each volatile
load, but this would be much slower for typical programs
using volatiles in which reads greatly outnumber writes.)
* ALTERNATIVELY, implement volatile store as CAS or LL/SC
(looping until success) and omit the barrier. This may be more
efficient if these instructions are cheaper than StoreLoad
barriers.
2. If the processor requires LoadLoad barriers:
* Issue a LoadLoad barrier after each volatile load
(Here and in remaining steps, on ia64 you need to instead fold
barriers into corresponding load or store instructions.)
* Issue an EnterLoad (i.e., LoadLoad) barrier after each MonitorEnter
unless the implementation of MonitorEnter uses an atomic
instruction that supplies such a barrier.
3. If the processor requires StoreStore barriers:
* Issue a StoreStore barrier before each volatile store
* Issue a StoreExit (i.e. StoreStore) barrier before each
MonitorExit unless the implementation of MonitorExit uses an
atomic instruction that supplies such a barrier.
* Either issue a StoreStore barrier after each store of a final
field in a constructor, or issue one before return from any
constructor for any class with a final field.
For the simplest examples, basic conformance to JSR-133 on x86-PO or
sparc-TSO using CAS for lock and unlock amounts only to placing a
StoreLoad membar after volatile stores.
This conservative strategy is likely to perform acceptably for most
programs, but can be improved in at least the following ways:
* Removing redundant barriers. The above tables indicate that
barriers can be eliminated as follows:
LoadLoad [no loads] LoadLoad => [no loads] LoadLoad
LoadLoad [no loads] StoreLoad => [no loads] StoreLoad
StoreStore [no stores] StoreStore => [no stores] StoreStore
StoreStore [no stores] StoreLoad => [no stores] StoreLoad
StoreLoad [no vloads] StoreLoad => [no vloads] StoreLoad
StoreLoad [no loads] LoadLoad => StoreLoad [no loads]
StoreLoad [no stores] StoreStore => StoreLoad [no stores]
Further eliminations may be possible depending on how locks are
implemented.
Doing this in the presence of loops, calls, and branches is left
as an exercise for the reader. :-)
* Specifically optimizing StoreLoads. A StoreLoad is strictly
necessary only between stores and subsequent loads of the SAME
location(s). For example, assuming distinct (unaliased) locations
a, b, c:
store a; StoreLoad; load b; store c; StoreLoad =>
store a; load b; store c; StoreLoad =>
* Moving the point in the instruction stream that the barriers
are issued, to improve scheduling, so long as they still
occur somewhere in the interval they are required.
* Removing barriers that aren't needed because there is no
possibility that multiple threads could rely on them; for example
volatiles that are provably visible only from a single thread.
This usually requires a fair amount of analysis.
-- Doug Lea, Computer Science Department, SUNY Oswego, Oswego, NY 13126 USA dl@cs.oswego.edu 315-312-2688 FAX:315-312-5424 http://gee.cs.oswego.edu/------------------------------- JavaMemoryModel mailing list - http://www.cs.umd.edu/~pugh/java/memoryModel
This archive was generated by hypermail 2b29 : Thu Oct 13 2005 - 07:00:41 EDT