2265. 29.3p9 appears to rule out some acceptable executions

Section: 32.5.4 [atomics.order] Status: Open Submitter: Brian Demsky Opened: 2013-06-17 Last modified: 2016-01-28

Priority: 4

View other active issues in [atomics.order].

View all other issues in [atomics.order].

View all issues with Open status.

Discussion:

I believe that the following variation on IRIW should admit executions in which c1 = d1 = 5 and c2 = d2 = 0. If this is allowed, then what is sequence of program evaluations for 32.5.4 [atomics.order] p9 that justifies the store to z? It seems that 32.5.4 [atomics.order] p9 should not allow this execution because one of the stores to x or y has to appear earlier in the sequence, each of the fetch_adds reads the previous load in the thread (and thus must appear later in the sequence), and 32.5.4 [atomics.order] p9 states that each load must read from the last prior assignment in the sequence.

atomic_int x;
atomic_int y;
atomic_int z;
int c1, c2, d1, d2;

static void a(void* obj)
{
  atomic_store_explicit(&x, 5, memory_order_relaxed); 
}

static void b(void* obj)
{
  atomic_store_explicit(&y, 5, memory_order_relaxed); 
}

static void c(void* obj)
{
  c1 = atomic_load_explicit(&x, memory_order_relaxed);
  // this could also be an atomic load if the address depends on c1:
  c2 = atomic_fetch_add_explicit(&y, c1, memory_order_relaxed);  
}

static void d(void* obj)
{
  d1 = atomic_load_explicit(&y, memory_order_relaxed);
  d2 = atomic_fetch_add_explicit(&x, d1, memory_order_relaxed); 
}

int user_main(int argc, char** argv)
{
  thrd_t t1, t2, t3, t4;

  atomic_init(&x, 0);
  atomic_init(&y, 0);

  printf("Main thread: creating 4 threads\n");
  thrd_create(&t1, (thrd_start_t)&a, NULL);
  thrd_create(&t2, (thrd_start_t)&b, NULL);
  thrd_create(&t3, (thrd_start_t)&c, NULL);
  thrd_create(&t4, (thrd_start_t)&d, NULL);

  thrd_join(t1);
  thrd_join(t2);
  thrd_join(t3);
  thrd_join(t4);
  printf("c1=%d c2=%d\n",c1,c2);
  printf("d1=%d d2=%d\n",d1,d2);

  // Can this store write 1000 (i.e., c1=d1=5, c2=d2=0)?
  atomic_store(&z, (c1+d1)*100+c2+d2);

  printf("Main thread is finished\n");

  return 0;
}

It seems that the easiest fix is to allow a load in 32.5.4 [atomics.order] p9 to read from any prior store in the evaluation order.

That said, I would personally advocate the following: It seems to me that C/C++ atomics are in a bit of different situation than Java because:

  1. People are expected to use relaxed C++ atomics in potentially racy situations, so it isn't clear that semantics as complicated as the JMM's causality would be sane.

  2. People who use C/C++ atomics are likely to be experts and use them in a very controlled fashion. I would be really surprised if compilers would find any real wins by optimizing the use of atomics.

Why not do something like:

There is satisfaction DAG of all program evaluations. Each evaluation observes the values of variables as computed by some prior assignment in the DAG.

There is an edge x->y between two evaluations x and y if:

  1. the evaluation y observes a value computed by the evaluation x or

  2. the evaluation y is an atomic store, the evaluation x is an atomic load, and there is a condition branch c that may depend (intrathread dependence) on x and x-sb->c and c-sb->y.

This seems to allow reordering of relaxed atomics that processors do without extra fence instructions, allows most reorderings by the compiler, and gets rid of satisfaction cycles.

[2015-02 Cologne]

Handed over to SG1.

[2015-05 Lenexa, SG1 response]

This was partially addressed (weasel-worded) in C++14 (See N3786). The remainder is an open research problem. N3710 outlines a "solution" that doesn't have a consensus behind it because it costs performance. We have no better solution at the moment.

Proposed resolution: