Caching Is Easy. Invalidation Is Where Systems Falter.#

Adding a cache is a one-afternoon task. You pick a layer, set a TTL, watch your database load drop. It feels like engineering.

Invalidation is the part nobody writes down. It is also the part that wakes you up at 3am. Users seeing stale data after an update. A rollback that leaves garbage in cache for 24 hours. A race condition where two services update the same key in the wrong order and nobody notices for a week.

Phil Karlton’s famous line — “there are only two hard things in computer science: cache invalidation and naming things” — is funny because it is true. The difficulty is not technical. It is that invalidation requires you to reason about the relationship between your write path and your read path across time, across services, and across failures.

Here is how to do it without guessing.

1. Pick the Right Cache Layer First#

Before thinking about invalidation, you need to know what you are actually caching and why. Each layer has different invalidation semantics:

Layer	Scope	Invalidation control
Browser cache	Per user, per device	`Cache-Control` headers, versioned URLs
CDN (Cloudflare, Fastly)	Global edge nodes	Purge API, surrogate keys, `stale-while-revalidate`
Redis / Memcached	Shared across services	Delete by key, versioned keys, pub/sub events
In-process (in-memory)	Single process instance	Full control, but no visibility across instances

The mistake is picking a layer based on what is easy to add rather than what you are optimising for. If you are optimising for p95 latency, in-process cache with local state is fastest but hardest to invalidate across a fleet. If you are optimising for database load, Redis is the right answer. Know the tradeoff before you commit.

2. TTLs Are a Safety Net, Not a Strategy#

Setting a TTL is not an invalidation strategy. It is a fallback for when your actual invalidation fails.

A 24-hour TTL on a user profile means stale data can live for up to 24 hours after an update. If that is acceptable — say, for a recommendation model that retrains daily — then a long TTL is fine. If users expect to see their profile picture update immediately after they change it, 24 hours is not a TTL, it is a bug with an expiry.

Short TTLs (60 seconds, 5 minutes) are often used as a substitute for event-driven invalidation. This works at small scale and fails loudly at large scale — it is the mechanism behind the thundering herd problem covered in the previous article.

The right mental model: TTL is a ceiling on how stale your data can get if nothing else fires. Your primary invalidation should be event-driven. TTL is the safety net underneath.

3. Versioned Keys Beat Delete in Distributed Systems#

The naive invalidation approach is: on write, delete the cache key. Then the next read will miss and repopulate.

write user:123 → cache.delete("user:123")

This breaks down in distributed systems for a subtle reason: you cannot guarantee the order of operations across services and cache nodes.

Consider this sequence:

Service A reads user:123 from cache — gets v6 of the profile
Service B updates the user — deletes user:123 from cache
Service A, still processing, writes its stale v6 back into cache
Cache now has stale data, and it will stay there until next TTL

This is a classic write-after-invalidation race. The delete happened, but then a concurrent reader wrote stale data back.

Versioned keys eliminate this race. Instead of user:123, the key is user:123:v7. When the user is updated, the version is incremented. Old keys become unreachable, not deleted — they age out via TTL.

from dataclasses import dataclass
from datetime import timedelta

@dataclass
class User:
    id: int
    version: int
    name: str
    email: str


def cache_key(user_id: int, version: int) -> str:
    return f"user:{user_id}:v{version}"


# On write: bump version in DB, write new key
def update_user(db, cache, user: User, ttl: timedelta) -> None:
    user.version += 1
    db.save(user)
    key = cache_key(user.id, user.version)
    cache.set(key, user, ttl)


# On read: fetch current version from DB, then get versioned key from cache
def get_user(db, cache, user_id: int) -> User | None:
    version = db.get_version(user_id)
    key = cache_key(user_id, version)
    cached = cache.get(key)
    if cached:
        return cached
    user = db.get_user(user_id)
    cache.set(key, user, ttl=timedelta(minutes=10))
    return user

Rollbacks become trivially safe: decrement the version, old keys are already in cache, no garbage to clean up. There are also no delete races because you never delete — you just stop using old keys.

flowchart LR subgraph delete["Naive: delete on write"] direction TB W1["Write user:123"] --> D["DELETE user:123"] D --> R1["Next read repopulates"] R2["Concurrent read\nwrites stale v6 back"] -.->|"race"| D end subgraph versioned["Versioned keys"] direction TB W2["Write user:123\nbump to v7"] --> K["New key: user:123:v7"] K --> R3["Readers use v7"] OLD["Old key: user:123:v6\nstill in cache, unreachable"] -.->|"ages out via TTL"| K end

4. Invalidate by Event, Not by Guessing#

The write path should emit events. The cache invalidation layer consumes those events.

This separates two concerns that are often tangled: the service that writes data does not need to know what caches exist downstream. It just emits UserUpdated(id: 123, version: 7). Every cache layer that cares about users subscribes and handles it.

from dataclasses import dataclass

@dataclass
class UserUpdatedEvent:
    user_id: int
    version: int


# Write path — knows nothing about cache topology
class UserService:
    def __init__(self, db, events):
        self.db = db
        self.events = events

    def update_user(self, user: User) -> None:
        self.db.save(user)
        self.events.publish("user.updated", UserUpdatedEvent(
            user_id=user.id,
            version=user.version,
        ))


# Cache invalidation consumer — owns its own logic
class UserCacheConsumer:
    def __init__(self, cache):
        self.cache = cache

    def handle(self, event: UserUpdatedEvent) -> None:
        # bump version index so next reads use the new key
        version_key = f"user:{event.user_id}:version"
        self.cache.set(version_key, event.version, ttl=timedelta(hours=1))

This pattern scales well. When you add a new downstream service that caches user data, it subscribes to the same event stream without touching the write path. When you remove a service, the same. The write path stays stable.

flowchart LR DB[("Database")] WRITE["Write Service"] BUS["Event Bus"] WRITE -->|"save"| DB WRITE -->|"UserUpdated(123, v7)"| BUS BUS --> C1["Redis Consumer\nbump version key"] BUS --> C2["CDN Consumer\npurge edge cache"] BUS --> C3["Search Consumer\nreindex document"]

5. Stale-While-Revalidate for Correctness on Hot Paths#

When a cache miss blocks the user — they wait while you recompute — you have a latency problem that compounds under load. The alternative is to always serve something immediately, even if it is slightly stale, and refresh in the background.

This is stale-while-revalidate. Keep two TTLs per entry:

Soft TTL: serve fresh until this point
Hard TTL: serve stale (but trigger background refresh) until this point

import threading
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Any

@dataclass
class CacheEntry:
    value: Any
    soft_ttl: datetime
    hard_ttl: datetime


class StaleWhileRevalidateCache:
    def __init__(self):
        self._store: dict[str, CacheEntry] = {}

    def set(self, key: str, value: Any, soft: timedelta, hard: timedelta) -> None:
        now = datetime.utcnow()
        self._store[key] = CacheEntry(
            value=value,
            soft_ttl=now + soft,
            hard_ttl=now + hard,
        )

    def get(self, key: str) -> tuple[Any, bool]:
        entry = self._store.get(key)
        if not entry:
            return None, False

        now = datetime.utcnow()

        if now < entry.soft_ttl:
            return entry.value, True          # fresh, serve immediately

        if now < entry.hard_ttl:
            threading.Thread(                 # stale — serve now, refresh async
                target=self._refresh, args=(key,), daemon=True
            ).start()
            return entry.value, True

        return None, False                    # expired, must recompute

    def _refresh(self, key: str) -> None:
        # fetch from source and repopulate cache
        ...

Users see no latency increase. The database sees a steady stream of refreshes rather than a burst at TTL boundaries. This is the same mechanism CDNs use at the HTTP layer — bring it into your application cache layer.

6. Negative Caching and Stampede Guards#

Two smaller but important practices.

Negative caching: if a lookup returns no result (a 404, a missing record), cache that absence explicitly for a short TTL. Without it, every request for a nonexistent key hits the database, which is exactly the class of attack that amplifies with absent cache hits.

NOT_FOUND = object()  # sentinel value

user = db.get_user(user_id)
if user is None:
    cache.set(key, NOT_FOUND, ttl=timedelta(seconds=30))  # cache the miss
    raise UserNotFoundError(user_id)

Stampede guards: when a key expires and many requests miss simultaneously, only one should recompute. Use singleflight within a process and a distributed lock (Redis SET NX) across a fleet. Add jitter to TTLs so keys do not expire in a synchronized wave. The thundering herd article covers this in detail — these patterns apply here too.

Write Down Your Invalidation Strategy Before You Ship#

This is the practical takeaway. Before any caching feature goes to production, the design should answer:

What event triggers invalidation?
Who is responsible for invalidating (the writer, an event consumer, a scheduled job)?
What is the maximum staleness users will see?
What happens on rollback?
What happens if invalidation fails?

If the answer to any of these is “we will delete the key and hope” or “we will use a short TTL,” push back. Those answers are not strategies — they are deferred problems.

Monotonic versions per aggregate, event-driven invalidation, and stale-while-revalidate together make a system where cache correctness is a property of the design, not something you debug in production.