JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers

From: Doug Lea (dl@cs.oswego.edu)
Date: Fri Nov 29 2002 - 11:08:00 EST

Next message: Doug Lea: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Previous message: jmanson@cs.umd.edu: "Re: JavaMemoryModel: Can final fields change?"
Next in thread: Doug Lea: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Reply: Doug Lea: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Reply: Yue Yang: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It would be helpful to put together a "JSR-133 Cookbook for Compiler
Writers" as an unofficial accompaniment to the JSR-133 specs. The
idea is to provide concrete guidance that compiler writers can use
without having to figure most things out from the underlying models
and specs. The main goal is to help make sure compilers and JVMs
implement JSR-133 correctly. (Actually, my motivation for writing this
is that I've been dealing with interactions of volatiles, locks, and
atomic instructions in preliminary implementation of new JSR-166
features, and wished I had something like this around.)

Here's a draft. Comments, suggestions, and corrections would be very
welcome. After collecting feedback, I'll probably put this up as a
web page.

Contents
   1. Reorderings
   2. Memory Barriers
   3. Multiprocessors
   4. Barrier Instructions
   5. Interactions with Synchronization
   6. Barrier Recipes

1. REORDERINGS

For a compiler writer, The JMM mainly consists of rules disallowing
certain reorderings of instructions that access fields (where "fields"
also refers to array elements). The main JMM "Roach Motel" rules can
be viewed as a matrix:

                      Can Reorder
                       2ns insn
            NL NS VL or ME VS or MX
1st insn
NL - - - N
NS - - - N
VL or ME N N N N
VS or MX - - N N

where
  NL = normal (non-volatile) getfield, getstatic, array load
  NS = normal (non-volatile) putfield, putstatic, array store
  VL = volatile getfield, getstatic
  VS = volatile putfield, putstatic
  ME = MonitorEnter, entry to sync method.
  MX = Monitor Exit, exit from sync method

The cells for volatile loads are the same as MonitorEnter, and
volatiles stores are same as MonitorExit, so they are collapsed
together here.

Any number of other operations might be present between the indicated
1st and 2nd instructions. So, for example, the N in cell [NS, VS] says
that a non-volatile store cannot be reordered with ANY subsequent
volatile store.

Blank cells in the table mean that the reordering is allowed if the
accesses aren't otherwise dependent with respect to basic Java
semantics. For example even though the table doesn't say so, you can't
reorder a load with a subsequent store to the same location. But you
can if the locations are guaranteed to be distinct, and may wish to do
so in the course of various compiler transformations and
optimizations. However, in all cases reorderings must maintain
minimal Java safety properties even when accesses are incorrectly
synchronized by programmers: All observed field values must be either
the default zero/null "pre-initialization" values, or those written by
some thread. This usually entails zeroing all heap memory holding
objects before it is used in constructors and never reordering other
loads with the zeroing stores.

Loads and Stores of final fields act as "normal" accesses with respect to
locks and volatiles above, but impose two additional reordering rules.

1. A store of a final field (inside a constructor) cannot be reordered
   with a subsequent store (outside that constructor) of the reference
   to the object holding that field into a variable accessible to
   other threads.

That is, you cannot reorder
x.finalField = v; ... ; sharedRef = x;
This comes into play for example when inlining constructors, where you
cannot move stores of finals within constructors down below a store
outside of the constructor that might make the object visible to other
threads. (As seen below, this may also require issuing a barrier).

2. The initial load of a final field cannot be reordered with the
initial load of the reference to the object containing the
final field.

This comes into play in
x = sharedRef; ... ; i = x.finalField;
A compiler would never reorder these since they are dependent (also,
they don't require a barrier). The rule is listed here anyway
to illustrate that reliable use of final fields by Java programmers
requires that the load of sharedRef itself be synchronized, volatile,
or final, or derived from such a load, thus ultimately ordering the
initial stores in constructors with subsequent uses outside
constructors.

2. MEMORY BARRIERS

Compilers and processors must both obey reordering rules. No
particular effort is required to ensure that uniprocessors maintain
proper ordering, since they all guarantee "as-if-sequential"
consistency. But on multiprocessors, conformance requires emitting
barrier instructions. Even if a compiler optimizes away a field
access (for example because a loaded value is not used), barriers must
still be generated as if the access were still present. (Although see
below about independently optimizing away barriers.)

Memory barriers are only indirectly related to higher-level notions
described in JMM specs such as Acquire and Release. And memory
barriers are not themselves "synchronization barriers". And memory
barriers are unrelated to the kinds of "write barriers" used in some
garbage collectors.

Memory barrier instructions directly control only the interaction of a
CPU with its cache, with its write-buffer that holds writes waiting to
be flushed to memory, and/or its buffer of waiting loads or
speculatively executed instructions. (However, these effects may lead
to further interaction among caches, main memory and processors.)

A property of memory barriers that takes some getting used to is that
barrier instructions are usually designed to be placed BETWEEN memory
accesses. Despite the names given for barrier instructions on some
processors, the right/best barrier to use depends on the kinds of
accesses they separate. Here's a common categorization of barrier
types that maps pretty well to specific instructions (sometimes
no-ops) on existing processors:

LoadLoad
   "Load1; LoadLoad; Load2" ensures that Load1's data are loaded before
   data accessed by Load2 and all subsequent load instructions are
   loaded. In general, LoadLoad barriers are needed on processors that
   perform speculative loads and/or out-of-order processing.

StoreStore
   "Store1; StoreStore; Store2" ensures that Store1's data are flushed
   to memory before the data associated with Store2 and all subsequent
   store instructions are flushed. In general, StoreStore barriers are
   needed on processors that do not otherwise guarantee strict ordering of
   flushes from write buffers and/or caches to memory.

StoreLoad
   "Store1; StoreLoad; Load2" ensures that Store1's data are flushed to
   main memory before data accessed by Load2 and all subsequent load
   instructions are loaded. In particular, StoreLoad barriers protect
   against a subsequent load incorrectly using Store1's data value
   rather than that from a more recent store performed by a different
   processor. Because of this, a StoreLoad is strictly necessary only
   for separating stores from subsequent loads of the SAME
   location(s). StoreLoad barriers are needed on all recent
   multiprocessors, and are usually the most expensive kind. Part of
   the reason they are expensive is that they must disable mechanisms
   that ordinarily bypass cache to satisfy loads from write-buffers.
   This may be implemented by letting the buffer fully flush or some
   equally time-consuming operations.

LoadStore
   "Load1; LoadStore; Store2" ensures that Load1's data are loaded
   before all data associated with Store2 and subsequent store
   instructions are flushed. No processors appear to need LoadStore
   barriers, presumably because they must deal with Load/Store
   dependencies for cache control anyway.

Considering that LoadStore is never needed so is always a no-op, the
following table summarizes how barriers can be used to implement
JSR-133 ordering rules.

NL NS VL/ME VS/MX
NL - - - (no-op)
NS - - - StoreStore
VL/ME LoadLoad (no-op) LoadLoad (no-op)
VS/MX - - StoreLoad StoreStore

Plus the special final-field rule requiring a StoreStore barrier in
x.finalField = v; StoreStore; sharedRef = x;

On all processors, it turns out that instructions that perform
StoreLoad also obtain the effects of both StoreLoad and LoadLoad. So
StoreLoad can serve as a sort of universal (but usually expensive)
barrier. Note though that the opposite doesn't usually hold. It is NOT
in general the case that issuing both a StoreStore and LoadLoad gives
the equivalent of a StoreLoad.

3. MULTIPROCESSORS

Here's a listing of processors that are commonly used in MPs, along
with URLs where I found most of the imformation (Some require some
clicking around from the listed site). This isn't an exhaustive list,
but it includes those used in all current and near-future
multiprocessor Java implementations I know of. The list and the
properties of processors decribed below are not definitive. In some
cases I'm just reporting what I read, and could have misread. Please
help make it definitive.

  sparc-TSO
    Ultrasparc 1, 2, 3 (sparcv9) in TSO (Total Store Order) mode
    (Ultra3 only supports TSO. RMO in Ultra1/2 is never used so can be ignored.)
       UltraSPARC® III Cu User's Manual
       http://www.sun.com/processors/manuals/index.html

  x86-PO
    486, Pentium, P2, P3, Athlon and others that do NOT support
    "SSE2" instructions.
    Intel calls consistency properties for these "Processor Ordering" (PO)
       The IA-32 Intel® Architecture Software
       Developers Manual, Volume 3: System Programming Guide
       http://developer.intel.com/design/pentium4/manuals/245472.htm

  x86-SPO
    P4, P4 with hyperthreading, Xeon, Opteron, and others that
    DO support "SSE2" instructions.
    Intel calls consistency properties for these "Speculative
    Processor Ordering" (SPO)
       See above, plus:
       AMD x86-64 Architecture Programmer's Manual Volume 2: System Programming
       http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_875_7044,00.html

  ia64
     Itanium
       Intel® Itanium® Architecture Software Developer's Manual, Volume 2: System Architecture, Revision 2.1
       http://developer.intel.com/design/itanium/downloads/245318.htm

  alpha
    21264x and I think all others
      Alpha Architecture Handbook
      http://www.support.compaq.com/alpha-tools/documentation/current/chip-docs.html

  ppc
     6xx, 7xx, 7xxx
       MPC603e RISC Microprocessor Users Manual
       http://www.motorola.com/PowerPC/
       MPC7410/MPC7400 RISC Microprocessor Users Manual
       http://www.motorola.com/PowerPC/
       PowerPC Microprocessor Family Software Reference Manual
       http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600719DF2

  ppc64
    64bit POWER4 systems
       I can't locate online reference manual. For discussion of barriers see:
       http://www-1.ibm.com/servers/esdd/articles/power4_mem.html
       http://www-1.ibm.com/servers/esdd/articles/powerpc.html

  eppc
    "Book-E" enhanced powerpc / PowerPC-440 / Motorola-e500 "G5"
       Book E- Enhanced PowerPC Architecture
       http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600682CC7
       EREF: A Reference for Motorola Book E and the e500 Core
       http://e-www.motorola.com/webapp/sps/site/overview.jsp?nodeId=03M943030450467M0ys3k3KQ

4. BARRIER INSTRUCTIONS

Here's how JMM reordering rules translate to barrier instructions on
different processors:

Processor LoadLoad StoreStore StoreLoad

sparc-TSO no-op no-op membar(StoreLoad)
x86-PO no-op no-op lock-prefixed-op/cpuid
x86-SPO lfence no-op mfence
ia64 ld.acq st.rel mf
alpha mb wmb mb
ppc isync eieio/sync sync
ppc64 isync eieio/lwsync sync
eppc isync mbar/msync msync

Notes:

  * Some of the listed instructions have stronger barrier properties
    than actually needed in the indicated cells, but seem to be the
    cheapest way to get desired effects.

  * The listed instructions are those designed for use with normal
    program memory, but not necessarily other special forms/modes of
    caching and memory used for IO and system tasks. For example, on
    x86-SPO, StoreStore barriers ("sfence") are needed with
    WriteCombining (WC) caching mode. But this mode is designed for
    use in system-level bulk transfers etc. OSes use Writeback mode
    for programs and data, which doesn't require StoreStore barriers.

  * Some forms of some barriers on some processors have side effects
    on registers or status flags that may have the effect here of
    increasing their effective cost.

  * On x86 (both PO and SPO), any lock-prefixed instruction can be
    used as a StoreLoad barrier. The cpuid instruction also works but
    is slower. On x86-SPO, mfence is usually better than either unless
    a lock-prefixed instruction like CAS is needed anyway.

  * On ia64, LoadLoad and StoreStore barriers are folded into special
    forms of load and store instructions -- there aren't separate
    LoadLoad or StoreStore instructions. ld.acq acts as (load;
    LoadLoadBarrier) and st.rel acts as (StoreStoreBarrier; store).
    Note that neither of these provide a StoreLoad barrier -- you
    need a separate mf instruction for that.

  * All ppcs have the same basic (weak) consistency properties. Newer
    ones seem to differ mainly in providing somewhat cheaper barrier
    instructions (as well as further options for special caching modes
    that aren't used for program data). I don't know when (if ever)
    each of the options listed for StoreStore in ppcs is best, but I
    think it is the first one listed in each cell.

5. INTERACTIONS WITH SYNCHRONIZATION

The kinds of barriers needed on different processors further interact
with the implementation of MonitorEnter and MonitorExit. Locking and
unlocking usually entails the use of atomic operations compareAndSwap
(CAS) or LoadLinked/StoreConditional (LL/SC) that have the semantics
of performing a volatile load followed by a volatile store. On all
processors, atomic operations protect against read-after-write
problems for the locations being conditionally updated. (Otherwise
standard loop-until-success constructions wouldn't work in the desired
way.) However, they differ in whether they provide more general
barrier properties as well. On some processors these instructions also
intrinsically perform barriers that would otherwise be needed for
MonitorEnter/Exit; on others some or all of these barriers must be
specifically issued.

Volatiles and Monitors have to be separated to disentangle these
effects, giving:

NL NS VL VS ME MX
NL - - - (no-op) - (no-op)
NS - - - StoreStore - StoreExit
VL LoadLoad (no-op) LoadLoad (no-op) LoadEnter (no-op)
VS - - StoreLoad StoreStore StoreEnter StoreExit
ME EnterLoad (no-op) EnterLoad (no-op) EnterEnter (no-op)
MX - - ExitLoad ExitStore ExitEnter ExitExit

Plus the special final-field rule requiring a StoreStore barrier in
x.finalField = v; StoreStore; sharedRef = x;

In this table, "Enter" is the same as "Load" and "Exit" is the same as
"Store", unless overridden by the use and nature of atomic
instructions. In particular:

  EnterLoad is the same LoadLoad unless an atomic instruction is
    used in MonitorEnter and itself provides a barrier with at least
    the properties of LoadLoad, in which case it is a no-op.

  StoreExit is the same as StoreStore unless an atomic instruction
    is used in MonitorExit and itself provides a barrier with at least
    the properties of StoreStore, in which case it is a no-op.

The types other than EnterLoad and StoreExit are unlikely to play a role
in compilation (see below) but are listed for completeness.

Java-level access to atomic conditional update operations will be
available in JDK1.5 via JSR-166 (see
http://gee.cs.oswego.edu/dl/concurrency-interest/) so compilers will
need to issue associated code, using a variant of the above table that
collapses ME and MX, and requires that they be implemented using
atomic instructions.

Here's how atomic operations are supported on multiprocessors:

Processor Flavor Instructions Barrier?

sparc-TSO CAS casa full
x86-PO CAS cmpxchg full
x86-SPO CAS cmpxchg full
ia64 CAS cmpxchg target + acq or rel
alpha LL/SC ldx_l/stx_c target only
ppc LL/SC ldarx/stwcx target only
ppc64 LL/SC ldarx/stwcx target only
eppc LL/SC ldarx/stwcx target only

Notes:

  * On sparc and x86, CAS has implicit preceeding and trailing full
    StoreLoad barriers. The sparcv9 architecture manual says CAS need
    not have post-StoreLoad barrier property, but the chip manuals
    indicate that it does on ultrasparcs.

  * On ppc and alpha, LL/SC have implicit barriers only with respect
    to the locations being loaded/stored, but don't have more general
    barrier properties.

  * The ia64 cmpxchg instruction also has implicit barriers with
    respect to the locations being loaded/stored, but additionally
    takes an optional .acq (post-LoadLoad) or .rel (pre-StoreStore)
    modifier. For example, cmpxchg.acq can be used for MonitorEnter,
    and cmpxchg.rel for MonitorExit. There is not a form providing
    StoreLoad.

  * Some processors also have additional atomic instructions, for
    example fetch-and-add on x86 and ia64, that typically have the
    same memory properties as CAS.

JSR-133 also addresses a few other issues that may entail barriers in
more specialized cases involving synchronization:
  * Thread.start() and Thread exit require barriers that are
    normally generated by the synchronization entailed in
    implementations of these constructs.
  * Static final initialization may require barriers that are
    normally needed anyway to obey Java class loading and
    initialization rules.
  * Ensuring default zero/null initial field values normally
    entails barriers, synchronization, and/or low-level
    cache control within garbage collectors.
  * Calls to and returns from JNI routines may require barriers,
    although this seems to be a quality of implementation issue.
  * Most processors have other synchronizing instructions designed
    primarily for use with IO and OS actions. These don't impact JMM
    issues directly, but may be involved in IO, class loading, and
    dynamic code generation.

6. BARRIER RECIPES

Since barrier instructions apply between different kinds of accesses,
finding an "optimal" placement that minimizes the total number of
executed barriers all but impopssible. Compilers often cannot tell if
a given load or store will be preceded or followed by another that
requires a barrier; for example, when a volatile store is followed by
a return. The easiest conservative strategy is to generate barriers
assuming that the kind of access requiring the "heaviest" kind of
barrier occur when generating code for any given load or store:

0. If on a uniprocessor:

   * Don't generate barrier instructions, unless object memory is
     somehow shared with asynchrononously accessible IO memory (or,
     conceivably, if context switching doesn't entail sufficient
     synchronization).

1. On all processors (since all require StoreLoad barriers):

  * Issue a StoreLoad barrier after each volatile store.
     (Note that you could instead issue one before each volatile
     load, but this would be much slower for typical programs
     using volatiles in which reads greatly outnumber writes.)

  * ALTERNATIVELY, implement volatile store as CAS or LL/SC
     (looping until success) and omit the barrier. This may be more
     efficient if these instructions are cheaper than StoreLoad
     barriers.

2. If the processor requires LoadLoad barriers:

  * Issue a LoadLoad barrier after each volatile load
    (Here and in remaining steps, on ia64 you need to instead fold
    barriers into corresponding load or store instructions.)

  * Issue an EnterLoad (i.e., LoadLoad) barrier after each MonitorEnter
    unless the implementation of MonitorEnter uses an atomic
    instruction that supplies such a barrier.

3. If the processor requires StoreStore barriers:

* Issue a StoreStore barrier before each volatile store

  * Issue a StoreExit (i.e. StoreStore) barrier before each
    MonitorExit unless the implementation of MonitorExit uses an
    atomic instruction that supplies such a barrier.

  * Either issue a StoreStore barrier after each store of a final
    field in a constructor, or issue one before return from any
    constructor for any class with a final field.

For the simplest examples, basic conformance to JSR-133 on x86-PO or
sparc-TSO using CAS for lock and unlock amounts only to placing a
StoreLoad membar after volatile stores.

This conservative strategy is likely to perform acceptably for most
programs, but can be improved in at least the following ways:

* Removing redundant barriers. The above tables indicate that
barriers can be eliminated as follows:

    LoadLoad [no loads] LoadLoad => [no loads] LoadLoad
    LoadLoad [no loads] StoreLoad => [no loads] StoreLoad
    StoreStore [no stores] StoreStore => [no stores] StoreStore
    StoreStore [no stores] StoreLoad => [no stores] StoreLoad
    StoreLoad [no vloads] StoreLoad => [no vloads] StoreLoad
    StoreLoad [no loads] LoadLoad => StoreLoad [no loads]
    StoreLoad [no stores] StoreStore => StoreLoad [no stores]

Further eliminations may be possible depending on how locks are
implemented.

Doing this in the presence of loops, calls, and branches is left
as an exercise for the reader. :-)

  * Specifically optimizing StoreLoads. A StoreLoad is strictly
    necessary only between stores and subsequent loads of the SAME
    location(s). For example, assuming distinct (unaliased) locations
    a, b, c:
        store a; StoreLoad; load b; store c; StoreLoad =>
        store a; load b; store c; StoreLoad =>

  * Moving the point in the instruction stream that the barriers
    are issued, to improve scheduling, so long as they still
    occur somewhere in the interval they are required.

  * Removing barriers that aren't needed because there is no
    possibility that multiple threads could rely on them; for example
    volatiles that are provably visible only from a single thread.
    This usually requires a fair amount of analysis.

-- Doug Lea, Computer Science Department, SUNY Oswego, Oswego, NY 13126 USA dl@cs.oswego.edu 315-312-2688 FAX:315-312-5424 http://gee.cs.oswego.edu/

------------------------------- JavaMemoryModel mailing list - http://www.cs.umd.edu/~pugh/java/memoryModel

Next message: Doug Lea: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Previous message: jmanson@cs.umd.edu: "Re: JavaMemoryModel: Can final fields change?"
Next in thread: Doug Lea: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Reply: Doug Lea: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Reply: Yue Yang: "Re: JavaMemoryModel: Draft JSR-133 Cookbook for compiler writers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2b29 : Thu Oct 13 2005 - 07:00:41 EDT