The Thundering Herd Problem#

Your site’s homepage is heavily cached. The cache key expires every 60 seconds. At 12:00 PM, the cache expires. At 12:00:01 PM, 10,000 concurrent user requests come in. None of them find anything in cache. All 10,000 hit your database simultaneously to regenerate the exact same content.

Your database crashes. The site goes down.

This is a cache stampede, also called a thundering herd. The irony is that the cache was there to protect the database. The moment it failed to do its job, it caused something worse than if it had never existed.


Why It Happens#

The failure mode is simple: you have no coordination between concurrent readers on a cache miss.

When a key expires, every waiting request independently concludes “nothing in cache, I need to compute this.” They all race to the database. The database receives N times the expected load in a single instant — not spread over time, but all at once, for the exact same query.

At low traffic this is invisible. At scale it is catastrophic, and the damage is proportional to how popular the cached content is. Your homepage, your trending feed, your most-viewed product — these are the keys that will hurt you.

The mental model that matters: a cache miss path is a critical section. If you do not protect it, you have no mutual exclusion, and every TTL boundary becomes a potential self-inflicted denial of service.

flowchart LR subgraph users["10,000 concurrent requests"] R1(["req 1"]) R2(["req 2"]) RN(["req N..."]) end CACHE[("Cache\nKEY EXPIRED")] DB[("Database\nCRASHED")] R1 -->|"cache miss"| CACHE R2 -->|"cache miss"| CACHE RN -->|"cache miss"| CACHE CACHE -->|"all 10,000 fall through"| DB

Fix 1: Single Flight / Request Coalescing#

The cleanest solution is to allow only one request to rebuild the value. Everyone else either waits for that result or gets a stale copy.

In Go, the standard library has singleflight for exactly this:

var group singleflight.Group

func getHomepage(ctx context.Context) ([]byte, error) {
    result, err, _ := group.Do("homepage", func() (interface{}, error) {
        return fetchFromDB(ctx)
    })
    return result.([]byte), err
}

The Do call deduplicates: if 10,000 goroutines call it concurrently, only one actually runs the function. The other 9,999 block and share the result when it arrives.

In a distributed setup across multiple servers, singleflight alone is not enough — each server will deduplicate within itself but not across the fleet. For that you need a distributed lock in Redis or a coordination layer, with losers either waiting or serving stale.

flowchart TD MISS["Cache Miss"] LOCK["Distributed Lock\nSET lock:key NX PX 5000"] MISS --> LOCK LOCK -->|"wins"| WINNER["1 worker\nfetches from DB"] LOCK -->|"lose"| LOSERS["9,999 others\nwait / serve stale"] WINNER --> DB[("Database\n1 query")] DB --> CACHE[("Cache\npopulated")] CACHE --> RESULT["All requests\nget the result"] LOSERS --> RESULT

Fix 2: Serve Stale While Revalidate#

The idea is two TTLs on every cache entry: a soft TTL and a hard TTL.

  • Soft TTL: when this expires, the value is “stale but still usable.” Serve it immediately, but trigger a background job to refresh it.
  • Hard TTL: when this expires, the value is gone. Only used as a fallback floor.
Store: { value, soft_ttl: now+55s, hard_ttl: now+120s }

On read:
  if age < soft_ttl  → serve fresh
  if soft_ttl < age < hard_ttl → serve stale, kick off background refresh
  if age > hard_ttl  → must block and recompute (last resort)

Users never wait because there is always something to serve. The database never sees a stampede because refreshes happen in the background, one at a time, before the hard TTL hits. This is how CDNs like Cloudflare implement stale-while-revalidate at the HTTP layer — you can apply the same pattern at the application cache level.

flowchart LR W["Cache Write\nt = 0"] W --> FRESH["FRESH\n0s → 55s\nServe immediately"] FRESH --> STALE["STALE\n55s → 120s\nServe instantly\n+ background refresh"] STALE --> GONE["HARD EXPIRE\n> 120s\nMust recompute\n(last resort)"] STALE -.->|"async"| BG["Background\nworker refreshes\ncache quietly"]

Fix 3: Distributed Lock with Jitter#

If you cannot serve stale, the next best option is a lock with short lease.

In Redis:

SET lock:homepage <worker-id> NX PX 5000

NX means “only set if not exists.” PX 5000 is a 5-second expiry so the lock cannot be held forever if the worker crashes.

The worker that wins the lock recomputes and populates the cache. Losers poll cache with a small sleep until the value appears, then serve it.

The equally important part is jitter on expiry times. If you set every homepage cache key to expire at exactly 60 seconds, all of them expire together. Randomise the TTL by ±10% so keys are staggered:

ttl := 60*time.Second + time.Duration(rand.Intn(12))*time.Second
cache.Set("homepage", value, ttl)

Jitter is cheap and eliminates a whole class of correlated failures. Use it everywhere you set expiry times.


Fix 4: Proactive Refresh#

The most reliable protection is to never let a hot key expire under load in the first place.

If you know a key is hot — homepage, trending content, featured product — refresh it on a schedule before users become your cron job:

Every 55 seconds: refresh homepage cache
On publish event: immediately invalidate and refresh affected keys

This decouples cache population from user requests entirely. No miss, no stampede, no problem. The downside is that you need to know which keys are hot in advance, which requires either instrumentation or heuristics. For predictable hot keys like a homepage this is a no-brainer. For long-tail content it is less practical.


Fix 5: Circuit Breakers on the Rebuild Path#

Even with all the above, assume that sometimes a rebuild will fail. The database has a bad moment. The downstream service is slow. The lock winner times out.

Without protection, that failure cascades: the lock expires, new winner tries, also fails, repeat. You get a retry storm on top of the original stampede.

Protect the rebuild path the same way you protect any external dependency:

  • Concurrency limit: only N goroutines allowed to hit the database for cache rebuilds at once. Others degrade immediately.
  • Timeout: rebuild attempts have a strict deadline. A slow rebuild is worse than a fast failure because it holds the lock and blocks everyone waiting.
  • Degraded mode: if rebuild fails, serve the last known stale value with a flag rather than returning an error. An outdated homepage is far better than a 500.
if rebuildErr != nil {
    stale, exists := cache.GetStale("homepage")
    if exists {
        return stale, nil  // serve stale silently
    }
    return nil, rebuildErr
}

Putting It Together#

None of these fixes are mutually exclusive. A production-grade cache layer uses all of them in layers:

  1. Proactive refresh for known hot keys — most stampedes never happen
  2. Stale-while-revalidate as the default read strategy — users never wait
  3. Single flight / distributed lock as the fallback when stale is unavailable
  4. Jitter on every TTL — spreads load across time
  5. Circuit breakers on the rebuild path — contains failures when they do occur

The cache miss path is a critical section. Treat it like one from the start, before traffic proves you wrong.