Directory Protocols
The Directory
Section titled “The Directory”Each cache line in memory has an associated directory entry that tracks:
- State: whether the line is uncached, shared, or exclusively owned
- Sharer list: which caches currently hold a copy
Directory Entry Format
Section titled “Directory Entry Format”For a system with cores, a common representation is a bit vector:
For 64 cores, that’s 66 bits per cache line. With 64-byte cache lines and 256 GB of memory, that’s:
This is a significant overhead — real systems use compression techniques.
Directory Operations
Section titled “Directory Operations”Read Miss
Section titled “Read Miss”- Core sends
Read(A)to the home node (the directory controller for address ) - Directory looks up the state of line :
- Uncached: fetch from memory, send to Core , mark as Shared, add to sharer list
- Shared: send data to Core , add to sharer list
- Exclusive/Modified by Core : send intervention to Core → Core sends data to Core (and optionally to memory), update directory
Write Miss
Section titled “Write Miss”- Core sends
Write(A)to the home node - Directory sends invalidation messages to all sharers in the bit vector
- Each sharer invalidates its copy and sends an ack
- Once all acks are received, Core gets exclusive access
The critical path is the invalidation round-trip: the writing core must wait for all sharers to acknowledge before proceeding.
Reducing Directory Overhead
Section titled “Reducing Directory Overhead”| Technique | How it works | Tradeoff |
|---|---|---|
| Limited pointers | Track only sharers; broadcast invalidation if more | Saves storage; rare broadcasts |
| Coarse bit vector | One bit per cluster of cores, not per core | Fewer bits; some unnecessary invalidations |
| Sparse directory | Only store entries for cached lines (use a hash table) | Saves memory; lookup overhead |
| Hierarchical directory | Directory-of-directories for NUMA systems | Scales to large systems; more hops |
NUMA and Home Nodes
Section titled “NUMA and Home Nodes”In Non-Uniform Memory Access (NUMA) systems, each socket “owns” a portion of physical memory. The home node for address is the socket whose local memory contains .
Socket 0 (Home for addr 0x0000-0x3FFF) ├── Cores 0-15 └── Local DRAM + Directory
Socket 1 (Home for addr 0x4000-0x7FFF) ├── Cores 16-31 └── Local DRAM + DirectoryAccessing local memory is fast (~50 ns). Accessing remote memory requires going through the interconnect to the remote socket’s directory (~100-150 ns). This 2-3x latency difference is why NUMA-aware allocation matters.