Skip to content

Directory Protocols

Each cache line in memory has an associated directory entry that tracks:

  • State: whether the line is uncached, shared, or exclusively owned
  • Sharer list: which caches currently hold a copy

For a system with NN cores, a common representation is a bit vector:

State (2 bits)+Sharer vector (N bits)\text{State (2 bits)} + \text{Sharer vector (N bits)}

For 64 cores, that’s 66 bits per cache line. With 64-byte cache lines and 256 GB of memory, that’s:

256×10964×66833 GB of directory storage\frac{256 \times 10^9}{64} \times \frac{66}{8} \approx 33 \text{ GB of directory storage}

This is a significant overhead — real systems use compression techniques.

  1. Core ii sends Read(A) to the home node (the directory controller for address AA)
  2. Directory looks up the state of line AA:
    • Uncached: fetch from memory, send to Core ii, mark as Shared, add ii to sharer list
    • Shared: send data to Core ii, add ii to sharer list
    • Exclusive/Modified by Core jj: send intervention to Core jj → Core jj sends data to Core ii (and optionally to memory), update directory
  1. Core ii sends Write(A) to the home node
  2. Directory sends invalidation messages to all sharers in the bit vector
  3. Each sharer invalidates its copy and sends an ack
  4. Once all acks are received, Core ii gets exclusive access

The critical path is the invalidation round-trip: the writing core must wait for all sharers to acknowledge before proceeding.

TechniqueHow it worksTradeoff
Limited pointersTrack only KK sharers; broadcast invalidation if moreSaves storage; rare broadcasts
Coarse bit vectorOne bit per cluster of cores, not per coreFewer bits; some unnecessary invalidations
Sparse directoryOnly store entries for cached lines (use a hash table)Saves memory; lookup overhead
Hierarchical directoryDirectory-of-directories for NUMA systemsScales to large systems; more hops

In Non-Uniform Memory Access (NUMA) systems, each socket “owns” a portion of physical memory. The home node for address AA is the socket whose local memory contains AA.

Socket 0 (Home for addr 0x0000-0x3FFF)
├── Cores 0-15
└── Local DRAM + Directory
Socket 1 (Home for addr 0x4000-0x7FFF)
├── Cores 16-31
└── Local DRAM + Directory

Accessing local memory is fast (~50 ns). Accessing remote memory requires going through the interconnect to the remote socket’s directory (~100-150 ns). This 2-3x latency difference is why NUMA-aware allocation matters.