Caching Is Easy. Invalidation Is Where Systems Falter.
Caching Is Easy. Invalidation Is Where Systems Falter.#
Adding a cache is a one-afternoon task. You pick a layer, set a TTL, watch your database load drop. It feels like engineering.
Invalidation is the part nobody writes down. It is also the part that wakes you up at 3am. Users seeing stale data after an update. A rollback that leaves garbage in cache for 24 hours. A race condition where two services update the same key in the wrong order and nobody notices for a week.
Phil Karlton’s famous line — “there are only two hard things in computer science: cache invalidation and naming things” — is funny because it is true. The difficulty is not technical. It is that invalidation requires you to reason about the relationship between your write path and your read path across time, across services, and across failures.
Here is how to do it without guessing.
1. Pick the Right Cache Layer First#
Before thinking about invalidation, you need to know what you are actually caching and why. Each layer has different invalidation semantics:
| Layer | Scope | Invalidation control |
|---|---|---|
| Browser cache | Per user, per device | Cache-Control headers, versioned URLs |
| CDN (Cloudflare, Fastly) | Global edge nodes | Purge API, surrogate keys, stale-while-revalidate |
| Redis / Memcached | Shared across services | Delete by key, versioned keys, pub/sub events |
| In-process (in-memory) | Single process instance | Full control, but no visibility across instances |
The mistake is picking a layer based on what is easy to add rather than what you are optimising for. If you are optimising for p95 latency, in-process cache with local state is fastest but hardest to invalidate across a fleet. If you are optimising for database load, Redis is the right answer. Know the tradeoff before you commit.
2. TTLs Are a Safety Net, Not a Strategy#
Setting a TTL is not an invalidation strategy. It is a fallback for when your actual invalidation fails.
A 24-hour TTL on a user profile means stale data can live for up to 24 hours after an update. If that is acceptable — say, for a recommendation model that retrains daily — then a long TTL is fine. If users expect to see their profile picture update immediately after they change it, 24 hours is not a TTL, it is a bug with an expiry.
Short TTLs (60 seconds, 5 minutes) are often used as a substitute for event-driven invalidation. This works at small scale and fails loudly at large scale — it is the mechanism behind the thundering herd problem covered in the previous article.
The right mental model: TTL is a ceiling on how stale your data can get if nothing else fires. Your primary invalidation should be event-driven. TTL is the safety net underneath.
3. Versioned Keys Beat Delete in Distributed Systems#
The naive invalidation approach is: on write, delete the cache key. Then the next read will miss and repopulate.
write user:123 → cache.delete("user:123")
This breaks down in distributed systems for a subtle reason: you cannot guarantee the order of operations across services and cache nodes.
Consider this sequence:
- Service A reads
user:123from cache — gets v6 of the profile - Service B updates the user — deletes
user:123from cache - Service A, still processing, writes its stale v6 back into cache
- Cache now has stale data, and it will stay there until next TTL
This is a classic write-after-invalidation race. The delete happened, but then a concurrent reader wrote stale data back.
Versioned keys eliminate this race. Instead of user:123, the key is user:123:v7. When the user is updated, the version is incremented. Old keys become unreachable, not deleted — they age out via TTL.
from dataclasses import dataclass
from datetime import timedelta
@dataclass
class User:
id: int
version: int
name: str
email: str
def cache_key(user_id: int, version: int) -> str:
return f"user:{user_id}:v{version}"
# On write: bump version in DB, write new key
def update_user(db, cache, user: User, ttl: timedelta) -> None:
user.version += 1
db.save(user)
key = cache_key(user.id, user.version)
cache.set(key, user, ttl)
# On read: fetch current version from DB, then get versioned key from cache
def get_user(db, cache, user_id: int) -> User | None:
version = db.get_version(user_id)
key = cache_key(user_id, version)
cached = cache.get(key)
if cached:
return cached
user = db.get_user(user_id)
cache.set(key, user, ttl=timedelta(minutes=10))
return user
Rollbacks become trivially safe: decrement the version, old keys are already in cache, no garbage to clean up. There are also no delete races because you never delete — you just stop using old keys.
4. Invalidate by Event, Not by Guessing#
The write path should emit events. The cache invalidation layer consumes those events.
This separates two concerns that are often tangled: the service that writes data does not need to know what caches exist downstream. It just emits UserUpdated(id: 123, version: 7). Every cache layer that cares about users subscribes and handles it.
from dataclasses import dataclass
@dataclass
class UserUpdatedEvent:
user_id: int
version: int
# Write path — knows nothing about cache topology
class UserService:
def __init__(self, db, events):
self.db = db
self.events = events
def update_user(self, user: User) -> None:
self.db.save(user)
self.events.publish("user.updated", UserUpdatedEvent(
user_id=user.id,
version=user.version,
))
# Cache invalidation consumer — owns its own logic
class UserCacheConsumer:
def __init__(self, cache):
self.cache = cache
def handle(self, event: UserUpdatedEvent) -> None:
# bump version index so next reads use the new key
version_key = f"user:{event.user_id}:version"
self.cache.set(version_key, event.version, ttl=timedelta(hours=1))
This pattern scales well. When you add a new downstream service that caches user data, it subscribes to the same event stream without touching the write path. When you remove a service, the same. The write path stays stable.
5. Stale-While-Revalidate for Correctness on Hot Paths#
When a cache miss blocks the user — they wait while you recompute — you have a latency problem that compounds under load. The alternative is to always serve something immediately, even if it is slightly stale, and refresh in the background.
This is stale-while-revalidate. Keep two TTLs per entry:
- Soft TTL: serve fresh until this point
- Hard TTL: serve stale (but trigger background refresh) until this point
import threading
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Any
@dataclass
class CacheEntry:
value: Any
soft_ttl: datetime
hard_ttl: datetime
class StaleWhileRevalidateCache:
def __init__(self):
self._store: dict[str, CacheEntry] = {}
def set(self, key: str, value: Any, soft: timedelta, hard: timedelta) -> None:
now = datetime.utcnow()
self._store[key] = CacheEntry(
value=value,
soft_ttl=now + soft,
hard_ttl=now + hard,
)
def get(self, key: str) -> tuple[Any, bool]:
entry = self._store.get(key)
if not entry:
return None, False
now = datetime.utcnow()
if now < entry.soft_ttl:
return entry.value, True # fresh, serve immediately
if now < entry.hard_ttl:
threading.Thread( # stale — serve now, refresh async
target=self._refresh, args=(key,), daemon=True
).start()
return entry.value, True
return None, False # expired, must recompute
def _refresh(self, key: str) -> None:
# fetch from source and repopulate cache
...
Users see no latency increase. The database sees a steady stream of refreshes rather than a burst at TTL boundaries. This is the same mechanism CDNs use at the HTTP layer — bring it into your application cache layer.
6. Negative Caching and Stampede Guards#
Two smaller but important practices.
Negative caching: if a lookup returns no result (a 404, a missing record), cache that absence explicitly for a short TTL. Without it, every request for a nonexistent key hits the database, which is exactly the class of attack that amplifies with absent cache hits.
NOT_FOUND = object() # sentinel value
user = db.get_user(user_id)
if user is None:
cache.set(key, NOT_FOUND, ttl=timedelta(seconds=30)) # cache the miss
raise UserNotFoundError(user_id)
Stampede guards: when a key expires and many requests miss simultaneously, only one should recompute. Use singleflight within a process and a distributed lock (Redis SET NX) across a fleet. Add jitter to TTLs so keys do not expire in a synchronized wave. The thundering herd article covers this in detail — these patterns apply here too.
Write Down Your Invalidation Strategy Before You Ship#
This is the practical takeaway. Before any caching feature goes to production, the design should answer:
- What event triggers invalidation?
- Who is responsible for invalidating (the writer, an event consumer, a scheduled job)?
- What is the maximum staleness users will see?
- What happens on rollback?
- What happens if invalidation fails?
If the answer to any of these is “we will delete the key and hope” or “we will use a short TTL,” push back. Those answers are not strategies — they are deferred problems.
Monotonic versions per aggregate, event-driven invalidation, and stale-while-revalidate together make a system where cache correctness is a property of the design, not something you debug in production.