> I think we have consensus that TSO is too weak for volatile variables.
>
> Hans Boehm and Rick Hudson have both expressed some concern that SC
> might be to strong, and want to consider options that allow writes
> from different processors to be seen in different orders on different
> processors. Or something like that. See message below.
>
> Hans and Rick: Time for you guys to put your argument together and
> start making your case.
>
Sorry; I've been trying to fight a large pile of email. I think I'm losing.
In general, I disagree mildly with some of the previous posters, in that I
think the spec has to target moderately sophisticated programmers. I don't
believe we can change the fact that multithreaded programming is nontrivial.
If people write multithreaded code naively, it will be broken. If they fix
it by randomly adding extra synchronization, I suspect it will be slow and
broken.
There should be safe usage rules that are as simple as possible, at the
expense of some performance. But those should be derived from our
specification; they won't be the specification. They will probably require
that volatiles be used at most in very stylized ways. e.g. to publish an
object, and that all other shared accesses be synchronized. For this the
details of the spec hopefully won't matter. If they prevent us from writing
a simple collection of safe usage rules that's a problem. If they make the
output of programs that violate the safe usage rules more puzzling, perhaps
that will get someone to read the rules.
The spec details will affect people writing performance- or security-
sensitive code. I think performance is a major issue.
Getting back to the question here, I think there are two distinct, though
related issues, and I want to make sure we're talking about the same thing:
(Of course, I may already be confused.)
1) Can a volatile write be reordered with a following volatile read as in
TSO. I think this is basically the same question as whether monitor exit
and a following volatile read can be reordered. I currently think this is
observable (in spite of earlier messages to the contrary), but it's probably
not critical either way. Allowing the reordering is cheaper on Itanium, at
least for volatiles.
2) If two threads write two different volatile locations, and if there are
several observer threads, do all the observer threads see the writes in the
same order? To what extent is there a global total order among all the
volatile accesses?
Question 1:
I think you can observe if a volatile read is moved before an unlock
operation. Let y be volatile. Initially x and y are zero.
Thread 1:
x = 1;
unlock;
z = y; //volatile read
Thread 2:
y = 1;
<barrier, e.g. assignment of another volatile to itself>
x = 2;
Can this end with x = 1 and z = 0?
If the reordering of the read is not allowed, then all actions in each
thread are ordered. x = 1 precedes the read of y. Since x = 1 at the end,
it must have followed the x = 2 assignment. Thus the read of y must have
followed the assignment to y.
If we allow the reordering, then we can rewrite thread 1 as x = 1; z = y;
unlock . Sinze z = y is only an acquire barrier, this can be reordered as z
= y; x = 1; unlock . Thus x = 1 and z = 0 becomes a possible outcome.
Did I miss something?
This seems to be eseentially equivalent to reordering a volatile write
followed by a volatile read.
That having been said, I'm still leaning towards allowing the reordering.
Disallowing the ordering seems to require an extra memory barrier on at
least X86 and Itanium. Was there a natural example for which it actually
matters? It sounds like the specification would be simpler if we allowed
it.
Question 2:
I agree that requiring a total order would simplify the specification, and
be more intutive. But I'm really uncomfortable about doing so, since I
don't have much of an idea what it would really take to implement a total
order.
The IA64 spec states that "The IA-64 memory ordering model requires that
release and semaphore operations ... become visible to all observers in the
coherence domain in a single total order with the exception that each
processor may observe (via loads or acquire loads) its own update early." I
think the exception at the end still requires a full memory barrier after
each volatile store, just to avoid the danger of a subsequent local read
from the store buffer.
Based on the IA64 document, IA32 does not appear to require a total order at
all: "Further, IA-32 allows each processor in the coherence domain to
interleave the reference stream from other processors in the coherence
domain in a different order."
I have no idea what the situation is on other architectures. Nor is it
obvious to me that it is always fixable, e.g. by adding barriers. It's
certainly possible to conceive of memory barrier instructions that only
force ordering between memory references from a single processor, and say
nothing about a global ordering.
My intuition is also that this is hard to implement in a NUMA machine. I'd
love to hear confirmations from hardware architects that it's not. (I've
heard a rumor that at least one such machine was supposed to be sequentially
consistent in it's early design stages, but didn't quite turn out to impose
a total ordering in the end. This worries me more.)
I'd support a total order if someone could convince me that it is always
efficiently implementable. I'd consider it more seriously if I were
convinced that it was always implementable at all. As it stands, I'm not
convinced this is a feasible approach. But we clearly need to understand
the issue better.
General observation:
Intel's documentation for IA-64 states that C and C++ volatile variables
whould be implement with ld.acq and st.rel. I believe this implementation
is mostly consistent with the semantics of C/C++ volatiles on IA32, given
the IA32 memory model. It sounds like it's also very similar what you get
with volatiles on SPARC/TSO or MIPS? In the absence of strong arguments to
the contrary, I'd be inclined aim for as much consistency with this as
possible. Most of us will need to deal with the (admittedly very
ill-defined) Intel/C/C++ model anyway, and it's hard enough to teach people
one such memory model.
(I believe the Alpha C/C++ or PowerPC C/C++ model is actually quite
different. The compiler issues regular loads and stores in the right order,
but the hardware has weaker consistency constraints?)
Hans
-------------------------------
JavaMemoryModel mailing list - http://www.cs.umd.edu/~pugh/java/memoryModel
This archive was generated by hypermail 2b29 : Thu Oct 13 2005 - 07:00:33 EDT