Skip to content

Cache Operations

When the CPU reads address AA:

  1. Compute the index and tag from AA
  2. Look up the set at that index
  3. Compare the tag against all ways in the set
  4. Hit: return the data from the matching way
  5. Miss: fetch the line from the next cache level, place it in the set, return data

On a miss, if the set is full, one existing line must be evicted to make room.

Every write updates both the cache and the next level immediately.

  • Pro: Memory is always up-to-date; simpler coherence
  • Con: Every write generates memory traffic; slow

Writes only update the cache. The line is marked dirty. The dirty line is written to memory only when it’s evicted.

  • Pro: Reduces memory traffic dramatically (multiple writes to the same line are absorbed)
  • Con: Memory may contain stale data; evictions are more expensive

On a write miss (the address isn’t in the cache):

  • Write-allocate: fetch the line into the cache, then write to it. Pairs naturally with write-back.
  • No-write-allocate: write directly to the next level without caching. Pairs with write-through.

When a set is full and a new line must be loaded, which existing line do we evict?

PolicyDescriptionUsed in
LRUEvict the Least Recently Used lineConceptually ideal, expensive for high associativity
Pseudo-LRUTree-based approximation of LRUMost real L1/L2 caches
RandomEvict a random lineSome L3 caches; surprisingly competitive

True LRU requires tracking the access order of all WW ways — for 16-way, that’s log2(16!)44\log_2(16!) \approx 44 bits per set. Pseudo-LRU uses a binary tree with W1W - 1 bits, which is much cheaper.

Miss TypeCauseMitigation
CompulsoryFirst access to a block (cold start)Prefetching
CapacityWorking set exceeds cache sizeLarger cache, better algorithms
ConflictMultiple blocks compete for the same setHigher associativity

Understanding these categories helps diagnose performance problems: if increasing cache size doesn’t help, you likely have conflict misses (try higher associativity) or the working set fundamentally doesn’t fit.

Modern CPUs detect access patterns and prefetch cache lines before they’re needed:

  • Stride prefetcher: detects sequential or strided access (e.g., array traversal)
  • Stream prefetcher: tracks multiple concurrent access streams
  • Software prefetch: explicit __builtin_prefetch() hints from the programmer

Prefetching converts compulsory misses into hits by loading data before the CPU asks for it.