Skip to content

Memory Fences & Barriers

Relaxed memory models allow the hardware to reorder operations for performance. When ordering matters (e.g., publishing data to other threads), the programmer inserts a fence (or barrier) to prevent specific reorderings.

A fence is essentially saying: “make sure everything before this point is visible before anything after this point.”

x86-TSO already preserves most orderings. The fences available:

InstructionEffect
MFENCEFull fence: orders all loads and stores
SFENCEStore fence: orders stores only
LFENCELoad fence: orders loads only (rarely needed on x86)
LOCK prefixImplicit full fence (e.g., LOCK XADD)

In practice, most x86 synchronization uses LOCK-prefixed atomic instructions rather than standalone fences.

ARM has a relaxed model, so fences are critical:

InstructionScope
DMB ISHFull barrier (inner shareable domain)
DMB ISHSTStore barrier
DMB ISHLDLoad barrier
LDARLoad-acquire (barrier after the load)
STLRStore-release (barrier before the store)

LDAR/STLR are more efficient than DMB because they only order with respect to a single operation, not globally.

RISC-V uses a flexible fence instruction:

fence [predecessor], [successor]

where predecessors and successors are subsets of {r, w, i, o} (read, write, input, output):

ExampleMeaning
fence rw, rwFull fence
fence w, wStore fence
fence r, rLoad fence
fence w, rDrain store buffer before loads (like x86 MFENCE)

C++, Rust, and Java provide memory order annotations that the compiler maps to appropriate hardware fences:

OrderMeaningx86 costARM cost
relaxedNo ordering, just atomicityFreeFree
acquireNo reads/writes reordered before this loadFree (TSO)LDAR
releaseNo reads/writes reordered after this storeFree (TSO)STLR
acq_relBoth acquire and releaseFree (TSO)LDAR/STLR
seq_cstFull sequential consistencyLOCK or MFENCEDMB ISH + LDAR/STLR
// Acquire the lock
while (lock.exchange(1, std::memory_order_acquire) == 1) {}
// Critical section — all accesses ordered after acquire
// ...
// Release the lock
lock.store(0, std::memory_order_release);
// All accesses ordered before release
// Producer (Core 0)
data = compute(); // write data
flag.store(1, std::memory_order_release); // publish
// Consumer (Core 1)
while (flag.load(std::memory_order_acquire) == 0) {} // wait
use(data); // guaranteed to see producer's write to data

The release-acquire pair creates a happens-before relationship: everything before the release is visible after the acquire.

std::atomic<Singleton*> instance{nullptr};
std::mutex mtx;
Singleton* get_instance() {
auto* p = instance.load(std::memory_order_acquire);
if (!p) {
std::lock_guard lock(mtx);
p = instance.load(std::memory_order_relaxed);
if (!p) {
p = new Singleton();
instance.store(p, std::memory_order_release);
}
}
return p;
}

Without proper memory orders, another thread could see the pointer before the constructor’s writes are visible — a classic bug on ARM.

Looking back at our progression:

  1. Cache fundamentals gave us the memory hierarchy and locality concepts
  2. Coherence protocols (MSI/MESI) ensure a single address is consistent across caches
  3. Consistency models define what orderings are allowed across multiple addresses
  4. Fences and barriers let programmers enforce the orderings they need within the guarantees of their target consistency model

Understanding this full stack — from cache lines to memory orders — is essential for writing correct and performant concurrent code.