JWKS in Production

Multi-Tenancy, Security, and Operations

Back to all posts

Posted on January 16, 2026

 This article is part of a series on understanding the hows and whys of JSON Web Signatures (JWS).

 There's accompanying code: it's refered to and linked throughout the content. But if you'd rather just read raw code, head over here.

This is a comprehensive reference guide covering aspects of production JWKS systems. Some may find it long and tedious, so here are key sections you may want to jump to:

We've previously gone over how to implement JWKS and achieve zero-downtime key rotation. You have the fundamentals: publishing keys at /.well-known/jwks.json, the four-phase rotation process, basic TTL-based caching. That's enough to get JWKS running.

But production systems—especially in financial services—face challenges that don't appear in tutorials:

  • Multi-tenancy: You're not verifying tokens from one partner. You're verifying from 200 issuer banks, each with their own JWKS endpoint, rotation schedule, and operational quirks.
  • Partner outages: When a partner's JWKS endpoint goes down at 3 AM, do you reject all their legitimate transactions? Or use stale cached keys and risk accepting compromised signatures?
  • Security incidents: What happens when an attacker steals a partner's private key and DoS their JWKS endpoint simultaneously?
  • Operational maturity: How do you monitor 200 partners' key health? Who gets paged when caches go stale? What's the incident response runbook?

This post covers production-grade JWKS architecture: multi-tenant caching, graceful degradation, security safeguards for grace periods, and operational patterns that let you sleep at night.

We'll be building on the fundamentals introduced in the previous post so if you haven't read it yet, do so now.

Multi-Tenancy: Managing Keys for Many Partners

If you're building a payment scheme API, you're not just verifying signatures from one partner, you're verifying from dozens or hundreds. Each partner has their own JWKS endpoint, their own key rotation schedule, their own operational quirks. This is where JWKS architecture gets interesting.

The Challenge

Your scheme processes authorization requests from 200 issuer banks. Each issuer:

  • Publishes their own JWKS at their own domain
  • Rotates keys on their own schedule
  • May have 2-5 active keys at any given time
  • Has different SLAs for JWKS endpoint availability

Naive approach:

  
  
    
# DON'T - Fetching JWKS on every request
# This is slow, unreliable, and will kill your latency
def verify_partner_signature(jws, partner_id) do
  partner_domain = get_partner_domain(partner_id)
  {:ok, jwks} = Req.get("https://#{partner_domain}/.well-known/jwks.json")

  # Extract kid and find matching key
  {:ok, kid} = extract_kid(jws)
  key = find_key_in_jwks(jwks, kid)

  # Verify signature
  JOSE.JWS.verify_strict(key, ["ES256"], jws)
end

    
  

Problems:

  • Makes a network call on every verification (adds 50-200ms latency)
  • If a partner JWKS endpoint goes down, all their requests will fail
  • No caching means there's a DoS vulnerability (attacker sends many requests, you make many JWKS fetches)
  • Partner endpoints (which aren't under your control) can impact your own availability

Architecture: Per-Partner JWKS Cache

A production-grade JWKS architecture for multi-tenant systems must satisfy three core requirements:

  • Isolation: One partner's key issues cannot impact others. Each partner gets their own cache namespace with independent TTL, grace period, and error handling.
  • Performance: Key lookups must complete in microseconds, not milliseconds. Cache hits avoid network calls entirely, reducing latency 100-2000x compared to fetching JWKS on every request.
  • Graceful degradation: Temporary partner outages should not block legitimate transactions. Stale keys remain usable during endpoint unavailability, with automatic recovery when service resumes.

Cache Structure

The cache uses ETS (Erlang Term Storage) for lock-free concurrent reads, with composite keys combining partner identity and key ID:

  
  
    
# ETS table structure: :jwks_cache
# Entry format: {{partner_id, kid}, jwk, cached_at, ttl}

:ets.insert(:jwks_cache, {
  {"partner_abc", "2025-01-key"},  # Composite key
  jwk,                              # JOSE.JWK structure
  1704470400,                       # Unix timestamp (cached_at)
  900                               # TTL in seconds (15 minutes)
})

    
  

Each partner maintains independent metadata tracking fetch history, success rates, and DoS protection state. This enables per-partner tuning and isolated error handling.

Cache States and Decision Logic

Cache lookups follow a three-state model with different response strategies:

  
  
    
case :ets.lookup(:jwks_cache, {partner_id, kid}) do
  [{_, jwk, cached_at, ttl}] ->
    age = now() - cached_at

    cond do
      # FRESH: Within TTL, return immediately (~100μs)
      age < ttl ->
        {:ok, jwk}

      # STALE: Expired but within 24h grace period
      # Return cached key + trigger background refresh
      age < 86_400 ->
        spawn(fn -> refresh_jwks(partner_id) end)
        {:ok, jwk}  # Don't block request!

      # TOO STALE: Must refresh, block on HTTP call
      true ->
        fetch_and_cache_jwks(partner_id)
        retry_lookup(partner_id, kid)
    end

  [] ->
    # Unknown kid - apply DoS protection before fetching
    handle_cache_miss_with_protection(partner_id, kid)
end

    
  

The stale-while-revalidate pattern (second branch) is critical for availability. When a partner's JWKS endpoint experiences an outage, their cached keys remain usable for up to 24 hours, preventing legitimate transaction rejections. The background refresh ensures automatic recovery when the endpoint returns to service.

Note that when we spawn(fn -> refresh_jwks(partner_id) end) we deliberately use spawn/1 rather than spawn_link/1: we don't want failures in JWKS fetching bring down the parent process.

Important: The fetch_and_cache_jwks/1 call includes critical safeguards: 5-second HTTP timeout (receive_timeout: 5000), DoS protection layers (circuit breaker, rate limiting, debouncing via handle_cache_miss_with_protection/3), and error handling. The "TOO STALE" path (cache > 24h old) should be rare in healthy systems - it primarily handles cold starts or extended partner outages. See the implementation for complete timeout and retry configuration.

The Unknown Kid Challenge

The cache miss path (final branch) requires special handling. When a JWS references a kid not in cache, we must fetch fresh JWKS to check if the key legitimately exists. However, this creates a DoS vulnerability: an attacker can send JWS signatures with randomly-generated kid values, forcing repeated JWKS fetches. Each malicious request triggers an HTTP call to the partner's endpoint, amplifying the attack 100-1000x.

This vulnerability requires defense-in-depth protections, covered in the next section.

Implementation Reference

See the demo app for complete code including:

  • Cache state handling with age-based decisions here
  • DoS protection for unknown kid attacks here
  • Graduated alerting for stale grace period here
  • Emergency cache purge for security incidents here

The implementation demonstrates production patterns including telemetry integration, circuit breakers, rate limiting, and operational safeguards for financial services.

DoS Protection: Safeguarding Against Unknown Kid Attacks

The Vulnerability

Consider what happens when signature verification receives a JWS with an unknown kid value:

  
  
    
def verify_signature(jws, partner_id) do
  {:ok, kid} = extract_kid(jws)  # Attacker controls this!
  {:ok, jwk} = get_key(partner_id, kid)  # Cache miss = HTTP fetch
  JOSE.JWS.verify_strict(jwk, ["ES256"], jws)
end

    
  

The attack: An attacker sends 1,000 requests/second, each with a different randomly-generated kid value like "attack-00001", "attack-00002", etc. Your cache has never seen these keys, so each request triggers a fresh JWKS fetch to the partner's endpoint. Result: 1,000 outbound HTTP requests/second, overwhelming either your HTTP client connection pool or the partner's JWKS endpoint.

Attack characteristics:

  • Amplification: 1 malicious request → 1 HTTP fetch (50-200ms latency penalty)
  • Resource exhaustion: Connection pool depletion, partner endpoint overload
  • Plausible deniability: Requests look legitimate (valid JWS structure, real partner ID)
  • Multi-target: Attacker can rotate through different partner IDs to evade per-partner limits

Three-Layer Defense Strategy

Protection requires multiple safeguards working together. Each layer addresses a different attack dimension:

Layer 1: Debouncing (60-second window)
Once we fetch JWKS for a partner, don't fetch again for 60 seconds regardless of how many unknown kids are requested. This eliminates the amplification factor.

  
  
    
defp check_recent_fetch(partner_id, now) do
  metadata_key = {partner_id, :metadata}

  case :ets.lookup(:jwks_cache, metadata_key) do
    [{^metadata_key, metadata}] ->
      age = now - metadata.last_fetch_at

      if age < 60 do  # Debounce window
        {:debounced, age}  # Too recent, reject fetch
      else
        :ok  # Enough time passed, allow fetch
      end

    [] ->
      :ok  # No prior fetch, allow
  end
end

    
  

Effect: Attacker can trigger at most 1 JWKS fetch per 60 seconds per partner, regardless of request volume. 1,000 requests with different kids → 1 HTTP fetch.

Tuning: Increase window to 120s for partners with slow JWKS endpoints. Decrease to 30s for partners with frequent legitimate key rotations.

Layer 2: Rate Limiting (10 attempts/minute)
Track how many times unknown kids are requested per partner per minute. Block requests exceeding threshold.

  
  
    
defp check_unknown_kid_rate_limit(partner_id, now) do
  rate_limit_key = {partner_id, :rate_limit}
  window = 60  # 1 minute

  case :ets.lookup(:jwks_cache, rate_limit_key) do
    [{^rate_limit_key, count, window_start}]
        when now - window_start < window ->
      if count >= 10 do  # Rate limit threshold
        :telemetry.execute(
          [:jwks_cache, :rate_limit_exceeded],
          %{attempts: count},
          %{partner_id: partner_id}
        )
        {:error, :rate_limited}
      else
        # Increment counter within same window
        :ets.update_counter(:jwks_cache, rate_limit_key, {2, 1})
        :ok
      end

    _ ->
      # Start new window
      :ets.insert(:jwks_cache, {rate_limit_key, 1, now})
      :ok
  end
end

    
  

Effect: Even if debouncing allows through some unknown kid requests, we won't process more than 10 per minute per partner. This limits attack impact during the debounce window.

Tuning: For high-volume partners actively rotating keys, increase to 20-30/min. For low-volume partners, decrease to 5/min.

Layer 3: Circuit Breaker (5 consecutive failures)
Track consecutive unknown kid rejections. After threshold, open circuit and reject all requests for that partner until manual review.

  
  
    
defp check_circuit_breaker(partner_id) do
  metadata_key = {partner_id, :metadata}

  case :ets.lookup(:jwks_cache, metadata_key) do
    [{^metadata_key, metadata}] ->
      if metadata.consecutive_unknown_kids >= 5 do  # Threshold
        :telemetry.execute(
          [:jwks_cache, :circuit_breaker_open],
          %{consecutive_unknown_kids: metadata.consecutive_unknown_kids},
          %{partner_id: partner_id}
        )

        Logger.error("""
        Circuit breaker OPEN for #{partner_id}
        Too many consecutive unknown kids (#{metadata.consecutive_unknown_kids})
        Manual review required
        """)

        {:error, :circuit_breaker_open}
      else
        :ok
      end

    [] ->
      :ok  # No metadata yet
  end
end

    
  

Effect: Sustained attacks trigger circuit breaker, halting all requests until ops investigates. Prevents prolonged resource consumption. Counter resets on any successful cache hit.

Tuning: For partners with frequent key rotation issues, increase to 10-15 consecutive failures. For critical partners, decrease to 3 to trigger investigation faster.

How the Layers Work Together

  
  
    
defp handle_cache_miss_with_protection(partner_id, kid, state) do
  now = System.system_time(:second)

  with :ok <- check_circuit_breaker(partner_id),
       :ok <- check_unknown_kid_rate_limit(partner_id, now),
       :ok <- check_recent_fetch(partner_id, now) do
    # All protections passed - safe to fetch JWKS
    Logger.info("All protection checks passed for #{partner_id}/#{kid}, fetching JWKS")
    handle_cache_miss(partner_id, kid, state, now)
  else
    {:error, :circuit_breaker_open} = error ->
      Logger.error("Circuit breaker OPEN for #{partner_id}, rejecting kid: #{kid}")
      {:reply, error, state}

    {:error, :rate_limited} = error ->
      Logger.error("Rate limit EXCEEDED for #{partner_id}, rejecting kid: #{kid}")
      {:reply, error, state}

    {:debounced, age} ->
      # We just fetched JWKS, kid truly doesn't exist
      Logger.warning("Unknown kid #{kid}, JWKS fetched #{age}s ago")
      increment_unknown_kid_counter(partner_id)
      {:reply, {:error, :kid_not_found_in_jwks}, state}
  end
end

    
  

Complete implementation here with full logging and telemetry integration.

Attack scenario walkthrough:

  1. t=0s: Attacker sends 100 requests with unknown kids
  2. First request passes all checks, triggers JWKS fetch
  3. Next 9 requests (within same second) pass rate limit, but debouncing blocks JWKS fetch. Kid not found, counter increments.
  4. Request #11 hits rate limit (10/min), rejected immediately
  5. Requests #12-100 also hit rate limit, rejected
  6. After 5 consecutive unknown kids, circuit breaker opens
  7. All subsequent requests rejected until manual review

Result: 100 attack requests → 1 JWKS fetch, then complete blocking. Attack neutralized.

Implementation Details

The three-layer defense strategy relies on per-partner state tracking and comprehensive observability. Let's examine how metadata is stored and how telemetry provides visibility into DoS attacks.

Each partner's protection state is tracked in ETS metadata:

  
  
    
# Stored in ETS as: {{partner_id, :metadata}, metadata}
metadata = %{
  # Last time we fetched JWKS (unix seconds)
  last_fetch_at: 1704470400,

  # Whether last fetch succeeded
  last_fetch_success: true,

  # Circuit breaker counter (resets on cache hit)
  consecutive_unknown_kids: 0
}

# Separate rate limit counter:
# {{partner_id, :rate_limit}, count, window_start}

    
  

Metadata updates occur at:

  • JWKS fetch: Update last_fetch_at and last_fetch_success ( code)
  • Cache hit: Reset consecutive_unknown_kids to 0 ( code)
  • Unknown kid: Increment consecutive_unknown_kids ( code)
  • Unknown kid request: Increment rate limit counter ( code)

The DoS protection system emits telemetry events for observability:

  
  
    
# Unknown kid rejected after debouncing
:telemetry.execute(
  [:jwks_cache, :unknown_kid_rejected],
  %{age_since_fetch: 30},  # Seconds since last fetch
  %{partner_id: "partner_abc", kid: "attack-00001"}
)

# Rate limit exceeded
:telemetry.execute(
  [:jwks_cache, :rate_limit_exceeded],
  %{attempts: 15},  # Total attempts in window
  %{partner_id: "partner_abc"}
)

# Circuit breaker opened
:telemetry.execute(
  [:jwks_cache, :circuit_breaker_open],
  %{consecutive_unknown_kids: 5},
  %{partner_id: "partner_abc"}
)

# Counter incremented (helps track trends)
:telemetry.execute(
  [:jwks_cache, :unknown_kid_incremented],
  %{consecutive_count: 3},
  %{partner_id: "partner_abc"}
)

    
  

Recommended alerts:

  • WARNING: Rate limit exceeded 3+ times in 5 minutes (possible attack)
  • ERROR: Circuit breaker opened (requires manual review)
  • INFO: Unknown kid counter > 3 (monitor for patterns)

Integrate telemetry with Datadog, New Relic, or your observability platform to visualize attack patterns and tune thresholds.

Tuning Recommendations

Adjust protection thresholds based on partner characteristics:

Partner Type Debounce Rate Limit Circuit Breaker Rationale
High-volume, stable 60s 10/min 5 consecutive Standard baseline
High-volume, frequent rotation 30s 20/min 10 consecutive Allow legitimate key churn
Low-volume 120s 5/min 3 consecutive Stricter limits, fewer false positives
Slow JWKS endpoint 180s 10/min 5 consecutive Reduce load on partner infrastructure
Development/staging 10s 50/min 20 consecutive Lenient for testing

Configuration approach: Store thresholds in partner configuration (database) rather than hardcoding. This enables per-partner tuning and A/B testing of threshold adjustments.

Implementation Checklist

Before deploying DoS protection to production:

  • Configure debounce window per environment (60s production, 10s staging)
  • Set rate limit thresholds based on partner risk tier (reference table above)
  • Connect telemetry events to monitoring system (Datadog, New Relic, Prometheus)
  • Create runbook for circuit breaker alerts (who investigates, escalation path)
  • Test DoS protection with load testing tool (send 1000 req/s with random kids)
  • Document tuning decisions in ADR (Architecture Decision Record)
  • Set up dashboards showing: rate limit hits, circuit breaker events, unknown kid trends
  • Configure alerts with appropriate severity levels (warning vs error)
  • Test circuit breaker reset procedure (manual counter reset)
  • Review protection effectiveness quarterly, adjust thresholds based on data

Key Takeaways

Unknown kid attacks exploit the cache miss path, forcing repeated JWKS fetches. A single protection mechanism isn't sufficient:

  • Debouncing eliminates amplification (1000 requests → 1 fetch)
  • Rate limiting contains burst attacks within the debounce window
  • Circuit breakers halt sustained attacks and trigger human investigation

The three layers combine to provide defense-in-depth: most attacks blocked by debouncing, sophisticated attacks caught by rate limiting, coordinated attacks halted by circuit breaker.

Remember: These protections are distinct from the grace period security safeguards (covered below). DoS protection addresses attack amplification via cache misses, while grace period security addresses compound attacks exploiting stale key usage during partner endpoint outages. Both are essential for production financial services.

Key Lookup Strategy

Database vs. Cache vs. ETS:

For 200+ partners, you need fast lookups:

  
  
    
# Strategy: Tiered lookup
# 1. ETS (microseconds) - Hot partners
# 2. Database (milliseconds) - Configuration
# 3. HTTP (50-200ms) - JWKS fetch

def verify_multi_tenant(jws, partner_id) do
  with {:ok, config} <- get_partner_config(partner_id), # DB/ETS
       true <- config.active,
       {:ok, kid} <- extract_kid(jws),
       {:ok, key} <- PartnerJWKSCache.get_key(partner_id, kid),
       {:ok, claims} <-
            verify_signature(jws, key, config.allowed_algorithms),
       :ok <- validate_claims(claims, config.clock_skew_tolerance) do
    {:ok, claims}
  else
    {:error, reason} -> {:error, reason}
    false -> {:error, :partner_inactive}
  end
end

    
  

Performance characteristics:

  • Partner config lookup: ~100μs (ETS) or ~1-2ms (PostgreSQL with connection pool)
  • JWKS key lookup: ~100μs (cache hit) or ~50-200ms (cache miss, HTTP fetch)
  • Signature verification: ~1-5ms (ES256)
  • Total: ~1-7ms for cache hit, ~50-210ms for cache miss

See code for details on cache hit optimization and TTL strategy (fresh cache ~100μs, stale-while-revalidate pattern, DoS protection on cache miss).

Monitoring Multi-Tenant Key Operations

Metrics you need:

  
  
    
# Per-partner metrics
Telemetry.execute([:partner_jwks, :fetch], %{duration: duration}, %{
  partner_id: partner_id,
  result: :success | :failure,
  cache_hit: true | false
})

Telemetry.execute([:partner_signature, :verify], %{duration: duration}, %{
  partner_id: partner_id,
  kid: kid,
  result: :success | :failure,
  failure_reason: reason
})

# Aggregate metrics
Telemetry.execute([:jwks, :cache], %{
  total_partners: 200,
  cached_partners: 185,
  stale_partners: 10,
  failed_partners: 5
})

    
  

Dashboard queries:

  • Which partners have the highest verification failure rates?
  • Which partners' JWKS endpoints are frequently unavailable?
  • Which kid values are most commonly used per partner?
  • Are any partners using disallowed algorithms?

Partner Onboarding Checklist

Before adding a new partner to production:

  • Partner provides JWKS URL: https://auth.issuer-abc.example/.well-known/jwks.json
  • Verify JWKS is accessible and returns valid JSON
  • Identify which algorithms they use (ES256, RS256, etc.)
  • Partner provides sample signed request for testing
  • Verify we can validate their sample signature
  • Add partner configuration to database:
    • partner_id
    • domain
    • allowed_algorithms
    • clock_skew_tolerance (default: 300s)
  • Deploy configuration to staging
  • Run contract tests with partner's test environment
  • Partner validates we can verify requests they sign
  • Set up monitoring alerts for this partner
  • Document partner-specific operational notes (e.g., "rotates keys monthly")
  • Add partner to production with jws_enforcement_mode: :warning
  • Monitor for 7 days in warning mode
  • Switch to jws_enforcement_mode: :enforced

Scaling Considerations

For 500+ partners:

JWKS fetch parallelization:

  
  
    
# Warm cache for all partners on startup
def warm_cache() do
  Repo.all(PartnerConfig)
  |> Task.async_stream(&fetch_and_cache_jwks/1, max_concurrency: 50, timeout: 10_000)
  |> Stream.run()
end

    
  

Full cache warming implementation here with error handling, success/failure tracking, and telemetry integration.

Circuit breakers per partner:

  
  
    
# Don't let one partner's attacks affect others
# Circuit opens after 5 consecutive unknown kids
def check_circuit_breaker(partner_id) do
  metadata = get_partner_metadata(partner_id)

  if metadata.consecutive_unknown_kids >= 5 do
    {:error, :circuit_breaker_open}
  else
    :ok
  end
end

    
  

Circuit breaker implementation here (threshold-based, per-partner state, automatic telemetry).

Database indexes:

  
  
    
CREATE INDEX idx_partner_configs_partner_id ON partner_configs(partner_id);
CREATE INDEX idx_partner_configs_active ON partner_configs(active) WHERE active = true;

    
  

Essential for fast partner config lookups when warming cache or handling requests for 500+ partners.

ETS table optimization:

  
  
    
# Single ETS table with read_concurrency for multi-core systems
:ets.new(:jwks_cache, [:set, :public, :named_table, read_concurrency: true])

    
  

Code for ETS table initialization with read_concurrency: true for efficient concurrent reads across multiple BEAM schedulers.

What Can Go Wrong

Partner rotates all keys simultaneously

Partner removes old key from JWKS before your cache expires. Requests with old kid fail.

Mitigation:

  • Educate partners on zero-downtime rotation (overlap period)
  • Use stale cache as fallback (covered in next section)
  • Alert when unknown kid spike occurs

Partner JWKS has 100+ keys

Some large institutions never remove old keys. JWKS becomes 500KB.

Mitigation:

  • Limit JWKS fetch size (reject responses >1MB)
  • Cache only keys seen in last 90 days
  • Work with partner to clean up old keys

Two partners share a JWKS endpoint

Partner A and Partner B are subsidiaries of same company, publish JWKS at same URL.

Solution:

  
  
    
%PartnerConfig{
  partner_id: "partner_a",
  jwks_url: "https://shared.example/.well-known/jwks.json",
  allowed_kids: ["partner-a-2025-01", "partner-a-2025-02"]  # Filter by kid
}

%PartnerConfig{
  partner_id: "partner_b",
  jwks_url: "https://shared.example/.well-known/jwks.json",
  allowed_kids: ["partner-b-2025-01", "partner-b-2025-02"]
}

    
  

Multi-tenancy adds operational complexity, but it's the reality of payment scheme APIs. The alternative (i.e., managing keys for 200 partners manually) is far worse.

Security Safeguards: Protecting the Grace Period

Why We Need a Grace Period: The Cost of Blocking Legitimate Transactions

Before discussing security risks, let's understand why the 24-hour stale grace period exists and why simply removing it isn't an option for financial services.

The Business Problem: Technical Hiccups Cost Money

In financial services, blocking legitimate transactions has real financial consequences:

  • Lost revenue: Partner can't process €10M in daily payment volume during your 2-hour JWKS cache outage
  • SLA penalties: Contracts often include 99.9% uptime requirements with financial penalties for breaches
  • Regulatory impact: Payment Service Providers must maintain operational continuity under PSD2
  • Customer churn: Merchants switch payment providers after repeated authorization failures
  • Reputational damage: "Payment gateway DOWN" trending on social media damages both parties

Common Failure Scenarios

JWKS endpoints can become temporarily unavailable for many non-malicious reasons:

  • Network partition: ISP routing issue between your datacenter and partner's JWKS endpoint (30 minutes to 2 hours)
  • DNS propagation: Partner moves JWKS endpoint, DNS takes time to propagate globally (15-60 minutes)
  • Certificate renewal: Partner's TLS certificate expires, they're scrambling to renew (1-4 hours)
  • Cloud provider outage: Partner's CDN has regional outage affecting JWKS delivery (2-8 hours)
  • DDoS mitigation: Partner's Web Application Firefall blocks legitimate traffic during attack response (variable duration)
  • Deployment gone wrong: Partner's infrastructure change breaks JWKS endpoint, rollback needed (30 minutes to 4 hours)
  • Rate limiting: Your cache refresh hits partner's rate limits during high traffic (until rate limit window resets)

The Strict Approach: Fail Immediately on Cache Miss

Without a grace period, here's what happens when partner's JWKS endpoint is unreachable:

  
  
    
def get_key(partner_id, kid) do
  with {:ok, jwk, cached_at} <- fetch_from_cache(partner_id, kid),
       true <- fresh?(cached_at) do
    # Fresh cache hit
    {:ok, jwk}
  else
    # Expired or not cached - must fetch fresh JWKS
    _ ->
      case fetch_fresh_jwks(partner_id) do
        {:ok, jwks} -> cache_and_return(jwks, kid)
        # FAIL IMMEDIATELY - no grace period fallback
        {:error, _} -> {:error, :jwks_unavailable}
      end
  end
end

    
  

Impact of immediate failure:

  • All requests rejected with 401 Unauthorized until JWKS endpoint recovers
  • Cached keys that are still cryptographically valid are discarded
  • Legitimate transactions blocked due to infrastructure issues unrelated to security
  • Partner's business stops until their ops team fixes the endpoint

Example: The 2-Hour Outage

Let's walk through a real scenario:

  1. 10:00 AM: Partner's infrastructure team deploys a configuration change
  2. 10:05 AM: JWKS endpoint returns 503 Service Unavailable (misconfigured load balancer)
  3. 10:15 AM: Your cache expires (15-minute TTL), attempts refresh, gets 503
  4. Without grace period:
    • 10:15-12:00: All partner requests rejected (1h 45m of downtime)
    • €300K in payment authorizations blocked
    • Customers calling merchants: "Why isn't my payment working?"
    • Merchants calling partner: "Your payment gateway is down!"
    • Partner calling you: "Why are you rejecting our valid requests?"
  5. With 24-hour grace period:
    • 10:15-12:00: Stale keys still work, all transactions succeed
    • Zero customer impact
    • Ops team has time to investigate (not emergency page at 3 AM)
    • Partner fixes endpoint by 12:00, next cache refresh succeeds

The Grace Period Solution: Stale-While-Revalidate

The grace period implements the stale-while-revalidate caching pattern:

  
  
    
def get_key(partner_id, kid) do
  case fetch_from_cache(partner_id, kid) do
    {:ok, jwk, cached_at} ->
      age = now() - cached_at

      cond do
        # FRESH: Return immediately
        age < 15_minutes ->
          {:ok, jwk}

        # STALE BUT USABLE: Return cached + trigger background refresh
        age < 24_hours ->
          spawn(fn -> refresh_cache(partner_id) end)  # Don't block!
          {:ok, jwk}  #  Still works during partner outage

        # TOO STALE: Must refresh
        true ->
          case fetch_fresh_jwks(partner_id) do
            {:ok, jwks} -> cache_and_return(jwks, kid)
            {:error, _} -> {:error, :jwks_unavailable}
          end
      end
  end
end

    
  

Benefits:

  • Zero downtime for temporary partner outages (less than 24 hours)
  • Legitimate transactions continue even when JWKS endpoint is unreachable
  • Automatic recovery when endpoint comes back online
  • Non-blocking refresh attempts in background
  • Graceful degradation rather than hard failure

Why 24 hours specifically?

  • Covers most infrastructure outages (network issues, deployments, certificate problems)
  • Gives ops teams time to respond during business hours (no 3 AM emergency pages for transient issues)
  • Accounts for global timezones (partner's ops team might be asleep during your peak traffic)
  • Allows for vendor support ticket resolution times
  • Balances availability (long grace period) with security (not infinite)

The Security Risk: Compound Attack Scenarios

Now that we understand why the grace period is essential, we must address the security risk it creates.

The grace period assumes that stale keys are still trustworthy. This is true for infrastructure outages, but false during security incidents.

The Compound Attack Vector

A sophisticated attacker can exploit the grace period by executing a compound attack:

  1. Compromise Partner A's private key
    • Phishing attack on partner's ops team
    • Malware on partner's key management system
    • Insider threat (disgruntled employee)
    • Supply chain attack on partner's infrastructure
  2. Simultaneously launch DoS attack on Partner A's JWKS endpoint
    • Volumetric attack (saturate bandwidth)
    • Application-layer attack (exhaust server resources)
    • DNS attack (make endpoint unreachable)
  3. Your system behavior:
    • JWKS refresh attempts fail (endpoint unreachable)
    • Grace period activates (use stale cached keys)
    • Cached keys include the now-compromised key's public counterpart
  4. Attacker signs malicious payloads
    • Payment authorizations for attacker-controlled accounts
    • Fraudulent transaction instructions
    • Account modifications or data exfiltration
  5. You accept them as valid
    • Signature verifies correctly (attacker has the private key!)
    • Grace period allows stale keys (can't fetch fresh JWKS to see revocation)
    • Attack window: Up to 24 hours

Why This Attack is Realistic

This isn't theoretical - it combines two common attack patterns:

  • Key compromise: Happens regularly (phishing, malware, insider threats)
  • DoS attacks: Common and easy to execute (botnets, amplification attacks)

A sophisticated attacker (nation-state, organized crime) can execute both.

Attack Window Analysis

Scenario Attack Window Availability Impact Security Risk
No grace period ~15 minutes (TTL) HIGH (blocks legitimate traffic) LOW (short window)
24-hour grace period Up to 24 hours LOW (maintains availability) HIGH (long window)
Grace period + safeguards Minutes to hours (human response time) LOW (maintains availability) LOW (human verification)

For financial services handling irreversible transactions, a 24-hour attack window is unacceptable without safeguards.

Defense in Depth: Human-in-the-Loop Verification

The solution is not to eliminate the grace period (that sacrifices availability), but to add monitoring, alerting, and manual override capabilities.

Core principle: Use the grace period for technical issues (automatic), but enable human intervention for security incidents (manual override).

Safeguard 1: Graduated Alerting

Alert ops teams when cache enters stale grace period, with severity escalation based on cache age:

Cache Age Severity Response Rationale
less than 1 hour WARNING Ops notified via Slack/email Likely transient issue, monitor
1-4 hours ERROR Escalate to senior ops Extended outage, needs investigation
4-12 hours CRITICAL Page on-call engineer Unusual duration, possible incident
≥ 12 hours EMERGENCY Executive notification Approaching grace period limit

Implementation:

  
  
    
# In jwks_cache.ex
defp alert_stale_cache(partner_id, kid, age_seconds, cached_at) do
  severity = cond do
    age_seconds < 3600 -> :warning        # < 1 hour
    age_seconds < 14_400 -> :error        # < 4 hours
    age_seconds < 43_200 -> :critical     # < 12 hours
    true -> :emergency                    # >= 12 hours
  end

  # Emit telemetry for monitoring systems
  :telemetry.execute(
    [:jwks_cache, :stale_grace_period],
    %{age_seconds: age_seconds},
    %{
      partner_id: partner_id,
      kid: kid,
      severity: severity,
      cached_at: cached_at
    }
  )

  # Log actionable alert
  if severity in [:critical, :emergency] do
    Logger.log(severity, """
      JWKS CACHE STALE GRACE PERIOD ACTIVE
    Partner: #{partner_id}
    Key ID: #{kid}
    Cache age: #{format_duration(age_seconds)}
    Severity: #{severity}
    Last cached: #{format_timestamp(cached_at)}

    ACTION REQUIRED:
    1. Contact partner security team (out-of-band, use phone)
    2. Verify JWKS endpoint status (technical vs security incident)
    3. If key compromise confirmed: JWKSCache.emergency_purge("#{partner_id}", "your_name", "reason")
    4. Document in incident ticket
    """)
  end
end

    
  

Telemetry integration:

  
  
    
# In telemetry.ex - Connect to Datadog, New Relic, etc.
:telemetry.attach(
  "jwks-stale-alert",
  [:jwks_cache, :stale_grace_period],
  &handle_stale_cache_alert/4,
  nil
)

defp handle_stale_cache_alert(_event, measurements, metadata, _config) do
  if metadata.severity in [:critical, :emergency] do
    # Send to PagerDuty
    PagerDuty.trigger_incident(%{
      title: "JWKS cache stale for #{metadata.partner_id}",
      severity: metadata.severity,
      details: %{
        age_seconds: measurements.age_seconds,
        partner_id: metadata.partner_id,
        kid: metadata.kid
      }
    })

    # Send to Datadog
    Datadog.event(%{
      title: "JWKS Stale Grace Period",
      text: "Partner #{metadata.partner_id} cache is #{measurements.age_seconds}s old",
      alert_type: "error",
      tags: ["partner:#{metadata.partner_id}", "severity:#{metadata.severity}"]
    })
  end
end

    
  

Safeguard 2: Out-of-Band Verification

When alert fires, ops must verify the situation with the partner out-of-band (outside potentially compromised channels).

Incident Response Runbook:

  1. Receive alert: "JWKS cache stale for partner_abc (5 hours)"
  2. Check partner status page: Any announced maintenance or incidents?
  3. Contact partner security team:
    • Method: Phone call (NOT email - email could be compromised)
    • Use: Emergency contact list (pre-arranged, verified contacts)
    • Verify identity: Challenge questions or callback to known number
  4. Ask these questions:
    • "Are you experiencing JWKS endpoint issues?"
    • "Are you aware of any ongoing maintenance or deployment?"
    • "Have you detected any security incidents in the last 24 hours?"
    • "Have you rotated or revoked any signing keys recently?"
    • "Should we continue processing requests or halt?"
  5. Decision tree:
    • Partner confirms technical issue (deployment, DNS, network, etc.)
      • Continue processing
      • Monitor situation
      • Document in ticke
      • Follow up in 2h
    • Partner confirms security incident (key compromise, breach, attack)
      • EMERGENCY PURGE
      • Halt processing
      • Executive alert
      • Incident response
    • Unable to reach partner (no answer, out of office, etc.)
      • Escalate to senior
      • Reduce limits 50%
      • Keep trying every 30 minutes

Safeguard 3: Emergency Cache Purge

The emergency_purge/3 function provides a manual override for security incidents where a partner confirms private key compromise or security breach. When invoked, it immediately removes all cached keys for that partner. The next request triggers a fresh JWKS fetch, forcing validation against current keys.

The function requires senior ops approval and creates a complete audit trail including structured logs (for Security Information and Event Management (SIEM) ingestion), telemetry events (for monitoring alerts), and permanent compliance records with timestamp, operator, and business justification. If the partner's JWKS endpoint is still unavailable post-purge, requests fail closed with 401 (i.e.: security over availability). Multi-tenant isolation ensures other partners remain unaffected.

Safeguard 4: Audit Trail

All emergency purges logged for regulatory compliance and post-incident review:

  
  
    
defp log_cache_purge(partner_id, operator, reason, purged_count) do
  # Structured log for SIEM ingestion
  Logger.warning(
    "JWKS cache emergency purge executed",
    event: "jwks_cache_purge",
    partner_id: partner_id,
    operator: operator,
    reason: reason,
    purged_keys: purged_count,
    timestamp: DateTime.utc_now() |> DateTime.to_iso8601()
  )

  # also write to dedicated audit table
  Audit.log_operational_event(%{
    event_type: "jwks_cache_purge",
    partner_id: partner_id,
    operator: operator,
    reason: reason,
    metadata: %{purged_keys: purged_count}
  })
end

    
  

Audit trail captures:

  • Timestamp (when purge executed)
  • Operator identity (who executed it)
  • Partner ID (which partner affected)
  • Reason/justification (why it was necessary)
  • Number of keys purged (scope of action)
  • Incident reference (traceability to incident management system)

Supports:

  • Regulatory compliance: SOX, PCI-DSS, PSD2 require audit of security actions
  • Post-incident review: Timeline reconstruction for root cause analysis
  • Accountability: Clear record of who made what decision when
  • Pattern detection: Identify partners with frequent incidents

Additional Mitigations

Risk-Based Grace Periods

Vary grace period duration based on partner transaction volume:

  
  
    
defp get_grace_period(partner_id) do
  case get_partner_risk_profile(partner_id) do
    :high_volume   -> 7_200   # 2 hours  (€10M+ daily volume)
    :medium_volume -> 28_800  # 8 hours  (€1M-10M daily)
    :low_volume    -> 86_400  # 24 hours (< €1M daily)
  end
end

    
  

Rationale:

  • High-volume partners: Shorter window (higher potential fraud amount, more monitoring)
  • Low-volume partners: Longer window (lower risk, less need for 24/7 ops response)
  • Trade-off: Balances security risk with operational burden

Reduced Limits During Grace Period

Automatically reduce transaction limits when using stale keys:

  
  
    
defp handle_stale_cache(partner_id, age_seconds, jwk) do
  # Reduce transaction limits during grace period
  if age_seconds > @ttl do
    RateLimiter.set_partner_limit(partner_id, :reduced)
    # reduce limits to 50% of normal
    TransactionLimits.reduce_max_amount(partner_id, 0.5)
  end

  {:ok, jwk}
end

    
  

Effect:

  • Reduces blast radius if compromise goes undetected
  • Automatic restoration when fresh keys loaded

Anomaly Detection

Monitor transaction patterns even when signatures are valid:

  • Unusual transaction amounts: Flag amounts >2 standard deviations from partner's average
  • Velocity anomalies: Partner suddenly processing 10x normal transaction volume
  • Geographic anomalies: Transactions from countries partner doesn't operate in
  • Time-of-day anomalies: Transactions at 3 AM when partner normally has zero activity
  • Beneficiary patterns: Same beneficiary account receiving many transactions

During grace period: Require manual approval for flagged transactions.

Is This Over-Engineered?

For financial services: Absolutely not.

Why this is appropriate:

  1. Cost of failure is enormous
    • Funds often unrecoverable once transferred
  2. Transactions are irreversible
    • Payment authorizations can't easily be rolled back
    • Money movement is permanent
    • Cross-border transfers especially hard to reverse
  3. Regulatory requirements mandate it
    • PSD2: Requires strong customer authentication and fraud monitoring
    • SOX: Requires controls over financial reporting systems
    • PCI-DSS: Requires monitoring of all access to systems
    • GDPR: Requires security incident response capabilities
  4. Infrastructure already exists
    • Financial institutions have 24/7 SOC (Security Operations Center)
    • Ops teams already on-call for production incidents
    • Partner contact lists already maintained for business continuity
    • Incident management systems (PagerDuty, ServiceNow) already deployed
    • SIEM systems (Splunk, Datadog) already ingesting logs
  5. Industry standard practice
    • Banks have similar protocols for partner certificate revocation
    • Payment networks (Visa, Mastercard) have similar incident response
    • SWIFT has extensive security controls for messaging

For non-financial services: May be excessive. Consider:

  • Transaction reversibility: Can you undo the action if fraud detected?
  • Fraud impact: What's the worst-case financial loss?
  • Regulatory requirements: Are you subject to financial services regulations?
  • Operational maturity: Do you have 24/7 ops teams?
  • Risk tolerance: What's your appetite for security vs availability trade-offs?

For a SaaS application with reversible actions and low fraud risk, a simple grace period without human-in-the-loop may suffice.

Implementation Checklist

P0 - Critical (implement before production):

  • Telemetry events for stale cache entry
  • Emergency purge function with authorization
  • Basic audit logging for purge operations
  • Incident response runbook documented

P1 - Important (implement within first month):

  • Alert routing to ops team (PagerDuty, Datadog, etc.)
  • Partner emergency contact list (verified phone numbers)
  • Test emergency purge in staging environment
  • Quarterly incident response drills

P2 - Enhancement (implement within first quarter):

  • Risk-based grace periods by partner volume
  • Reduced transaction limits during grace period
  • Anomaly detection integration
  • Automated partner outage detection (status page monitoring)

Ongoing:

  • Review and update partner contact list monthly
  • Test incident response quarterly (tabletop exercises)
  • Review purge audit logs monthly for patterns
  • Update runbook based on real incidents

Key Takeaways

  1. Grace periods are essential for financial services
    • Blocking transactions due to technical hiccups costs real money
    • Infrastructure outages are common and usually non-malicious
    • Availability requirements often have SLA penalties
  2. But grace periods create security risk
    • Compound attacks (key compromise + endpoint DoS) are realistic
    • 24-hour attack window is too long for irreversible transactions
    • Stale keys are trustworthy for technical issues, not security incidents
  3. Defense in depth is the answer
    • Automated grace period for technical issues (common case)
    • Human verification for potential security incidents (edge case)
    • Manual override capability when compromise confirmed
  4. Not over-engineering for financial services
    • Standard practice aligned with industry regulations
    • Infrastructure and processes already exist
    • Cost of implementing significantly lower than cost of one successful attack
  5. Test your incident response
    • Runbooks only work if practiced
    • Quarterly drills identify gaps
    • Update procedures based on real incidents

The stale-while-revalidate pattern is excellent engineering when combined with operational discipline. The pattern handles the common case (temporary outages gracefully), while safeguards handle the edge case (active security incidents with manual override).

This is the balance between availability and security that financial services require: automatic grace for technical issues, human judgment for security incidents.

Operational Best Practices

Running JWKS in production requires operational discipline:

1. Monitor Key Usage by kid

Track which keys are actively used:

  • Count verifications per kid (exposes when old keys can be safely removed)
  • Alert on unknown kid values (potential attack or misconfiguration)
  • Expose metrics endpoint showing usage distribution across keys
  • Graph key usage over time to visualize rotation progress

2. Alert on Verification Failures

Monitor verification failure rates:

  • Track success/failure ratio (alert if > 5% failure rate)
  • Log detailed failure reasons (invalid_signature, key_not_found, expired)
  • Spike in failures often indicates: rotation issues, clock drift, or attacks
  • Integrate with alerting systems (PagerDuty, Slack, etc.)

3. Automate Rotation Schedule

Don't wait for incidents - rotate proactively:

  • Schedule rotations every 90 days
  • Automate the rotation process:
    • Phase 1: Generate new key pair
    • Phase 2: Publish new key → wait 15 minutes (cache propagation)
    • Phase 3: Switch signing → wait for token expiry (typically 24 hours)
    • Phase 4: Remove old key
  • Log each phase completion for audit trail
  • Test rotation in staging first

4. Emergency Rotation Runbook

Have a documented procedure for emergency rotation:

  
  
    
EMERGENCY KEY ROTATION PROCEDURE

Scenario: Private key compromised, must rotate immediately

1. Generate new key pair
   $ openssl ecparam -name prime256v1 -genkey -noout -out new-private-key.pem
   $ openssl ec -in new-private-key.pem -pubout -out new-public-key.pem

2. Add new key to JWKS (Phase 2)
   - kid: emergency-2024-12-05-001
   - Deploy to JWKS endpoint
   - Verify: curl https://auth.yourcompany.com/.well-known/jwks.json

3. Wait 15 minutes (cache propagation)

4. Switch signing to new key (Phase 3)
   - Update signing service configuration
   - kid: emergency-2024-12-05-001
   - Deploy

5. Remove compromised key from JWKS (Phase 4)
   - DO NOT wait for token expiration in emergency
   - All tokens signed with compromised key will fail
   - Acceptable trade-off vs keeping compromised key active

6. Notify affected parties
   - Issuers may need to re-authenticate
   - Document incident for post-mortem

Expected Impact:
- Tokens signed with compromised key: Invalid (immediate)
- Users: Must re-authenticate
- Downtime: ~0 seconds (but users will see auth errors until re-auth)

    
  

Security Considerations

JWKS improves operational security, but you must protect it:

DO:

  • Serve over HTTPS only - Man-in-the-middle attacks on JWKS mean game over
  • Set appropriate caching headers - Cache-Control: public, max-age=600
  • Rate limit the endpoint - Protect against DoS (see previous post)
  • Monitor JWKS fetches - Unusual spikes might indicate attack
  • Version your kid values - Makes rotation tracking easier
  • Log key usage by kid - Identify when old keys can be removed
  • Implement grace period safeguards - Alerting and emergency purge capabilities (covered above)

DON'T:

  • Never include private keys in JWKS - Only public keys, ever
  • Never disable HTTPS verification - Even in development
  • Never trust kid from untrusted tokens - Validate against your JWKS
  • Never forget to remove old keys - Keep JWKS lean (fewer than 5 active keys)
  • Never embed JWKS URLs in JWTs - Attackers could point to their own JWKS
  • Never rely solely on grace period - Must have human-in-the-loop verification for security incidents

Replay Attack Prevention: The jti Claim

JWKS verification proves a signature is valid, but it doesn't prevent replay attacks: a malicious actor intercepts a valid signed authorization and replays it hours later to initiate a duplicate payment. Even though the signature is valid (it's a real signed instruction), replaying it is an attack.

Timestamps (exp) provide time-based protection, but you need per-request uniqueness to detect replays within the expiration window.

Solution: Use jti (JWT ID) claims

When signing a payment instruction, include a unique jti value (typically a UUID) alongside your business identifier (instruction_id) and timestamps (iat, exp). During verification, check if the jti has been seen before. If yes, reject as a replay attack. If no, store the jti in your audit log or a dedicated cache and proceed with verification.

Why instruction_id isn't enough: Partners might legitimately retry a failed authorization with the same instruction_id after a network timeout or temporary error. But they should generate a new jti for each attempt. This lets you distinguish between legitimate retries (different jti, same instruction_id) and replay attacks (duplicate jti).

Implementation considerations:

  • Store jti values in your audit log for permanent record-keeping, or use a dedicated replay prevention cache (Redis with TTL) for high-throughput systems
  • Clean up expired jti values based on the exp claim—no need to store forever
  • Include jti validation in your verification pipeline before accepting the request
  • Coordinate with partners on jti generation (they must generate fresh values for retries)

The demo application generates jti values when signing authorization requests and stores them in the audit log schema. However, it intentionally does not implement replay detection (checking for duplicate jti values) to keep the example focused on JWS signature verification. Production systems should add uniqueness validation before processing requests.

Testing, Troubleshooting & Disaster Recovery

Production systems require rigorous testing, clear debugging procedures, and rapid recovery capabilities. This section covers production-specific concerns for multi-tenant JWKS caching.

Testing Production JWKS Systems

Production-Specific Test Scenarios

For detailed unit testing, contract testing with partners, and chaos engineering strategies, see the Testing, Migration & Troubleshooting guide. That post covers comprehensive test scenarios including signature verification, algorithm attacks, clock skew, and audit trail testing.

This section focuses on production-specific concerns for multi-tenant JWKS caching with 200+ partners:

  • Load testing cache performance: 1000 req/s across 200 partners, measuring cache hit ratio and latency percentiles
  • Testing DoS protection triggers: Circuit breaker activation, rate limiting thresholds, debounce window behavior
  • Cache warming on deployment: Verify 200 partners load in < 30 seconds with graceful handling of endpoint failures
  • Multi-partner failover scenarios: Simulate 50+ partners' JWKS endpoints down, verify stale-while-revalidate behavior
  • Telemetry integration: Validate all telemetry events fire with correct metadata and measurements

Critical Test Cases for Multi-Tenant Caching

Test coverage in example app:

  • Cache warming tests here
  • Circuit breaker testing here
  • Rate limiting and debouncing tests here
  • Emergency purge tests here

Example: Testing circuit breaker opens after threshold:

  
  
    
# From: test/jws_demo/jws/jwks_cache_test.exs
test "circuit breaker opens after 5 consecutive unknown kids" do
  partner_id = "test_partner"

  # Trigger 5 consecutive unknown kid failures
  for i <- 1..5 do
    {:error, :kid_not_found_in_jwks} =
      JWKSCache.get_key(partner_id, "unknown-kid-#{i}")
  end

  # 6th attempt should hit circuit breaker
  assert {:error, :circuit_breaker_open} =
    JWKSCache.get_key(partner_id, "unknown-kid-6")

  # Verify telemetry event fired
  assert_received {:telemetry_event,
    [:jwks_cache, :circuit_breaker_open],
    %{consecutive_unknown_kids: 5},
    %{partner_id: "test_partner"}}
end

    
  

This test validates the circuit breaker threshold ( @circuit_breaker_threshold 5 in implementation) and confirms telemetry events fire for monitoring integration.

Load Testing Recommendations

Performance targets for production scale:

Scenario Target (p99) Measurement
200 partners, cache hit < 5ms ETS lookup + verification
200 partners, cache miss < 100ms HTTP fetch + parse + cache
Circuit breaker activation < 10ms Counter check + log
Cache warming (200 partners) < 30s Parallel fetch with max_concurrency: 50

Use load testing tools like k6 or Locust for multi-partner scenarios. Simulate realistic traffic patterns: 80% cache hits, 15% cache misses (new kids during rotation), 5% unknown kids (attack simulation).

Troubleshooting Production Issues

For complete debugging protocols, common error patterns, and step-by-step troubleshooting procedures (signature verification failures, timestamp issues, algorithm mismatches), see the Testing, Migration & Troubleshooting guide. That post provides a comprehensive debugging protocol with manual verification steps using OpenSSL.

This section covers production-specific debugging challenges unique to multi-tenant JWKS caching:

Production-Specific Debugging

When managing 200+ partners, debugging becomes more complex. Common production scenarios:

  • Partner isolation: Which of 200 partners is causing cache failures? Check partner-specific telemetry metrics.
  • Circuit breaker state: Why is circuit breaker open for "partner_abc" ? Inspect consecutive_unknown_kids counter.
  • Cache hit ratio drop: Sudden decrease from 95% to 60% - what changed? Check for partner key rotations or JWKS endpoint changes.
  • Rotation confusion: Partner rotated keys but requests failing - old vs new kid mismatch. Verify JWKS fetch succeeded and new kid cached.
  • Stale cache not refreshing: Grace period active but background refresh failing silently. Check graduated alerting logs.

Using Telemetry to Diagnose Issues

Attach runtime telemetry handlers to debug cache issues in production without redeployment:

  
  
    
# Attach telemetry handler to debug cache issues in production
:telemetry.attach(
  "debug-cache-issues",
  [:jwks_cache, :get_key],
  fn _event, measurements, metadata, _config ->
    Logger.info("""
    Cache lookup:
      Partner: #{metadata.partner_id}
      Kid: #{metadata.kid}
      Result: #{metadata.result}
      Duration: #{measurements.duration}μs
      Cache age: #{metadata.cache_age}s
      Cache state: #{metadata.cache_state}
    """)
  end,
  nil
)

# Attach to DoS protection events
:telemetry.attach_many(
  "debug-dos-protection",
  [
    [:jwks_cache, :circuit_breaker_open],
    [:jwks_cache, :rate_limit_exceeded],
    [:jwks_cache, :unknown_kid_rejected]
  ],
  fn event, measurements, metadata, _config ->
    Logger.warning("DoS protection triggered: #{inspect(event)}",
      measurements: measurements,
      metadata: metadata
    )
  end,
  nil
)

    
  

These handlers provide real-time visibility into cache behavior and DoS protection activations without modifying code or increasing log verbosity globally.

Quick Diagnostic Checklist

When debugging partner-specific JWKS issues:

  1. Check telemetry dashboards for partner-specific metrics: cache hit ratio, fetch success rate, verification failures
  2. Inspect ETS cache state: :ets.tab2list(:jwks_cache) |> Enum.filter(fn {{pid, _}, _, _, _} -> pid == "partner_abc" end)
  3. Verify partner JWKS endpoint: curl https://partner-domain/.well-known/jwks.json - is it accessible? Returns valid JSON?
  4. Check circuit breaker state: Look for {partner_id, :metadata} entry in ETS, inspect consecutive_unknown_kids field
  5. Review last fetch timestamp: Check metadata.last_fetch_at - when did we last successfully fetch JWKS?
  6. Examine rate limit counters: Check {partner_id, :rate_limit} entry for unknown kid attempts
  7. Check graduated alerting logs: Search logs for WARNING → ERROR → CRITICAL → EMERGENCY severity progression
  8. Validate cached kid values: List all kids for partner, compare with current JWKS endpoint response

Example: Diagnosing why partner_abc requests failing:

  
  
    
# 1. Check what kids we have cached
iex> :ets.tab2list(:jwks_cache)
|> Enum.filter(fn
  {{"partner_abc", _kid}, _jwk, _cached_at, _ttl} -> true
  _ -> false
end)

[
  {{"partner_abc", "2024-12-key"}, %JOSE.JWK{}, 1704470400, 900},
  {{"partner_abc", "2025-01-key"}, %JOSE.JWK{}, 1704556800, 900}
]

# 2. Fetch current JWKS to compare
iex> {:ok, resp} = Req.get("https://partner-abc.example/.well-known/jwks.json")
iex> resp.body["keys"] |> Enum.map(& &1["kid"])
["2025-01-key", "2025-02-key"]  # New key! Old 2024-12-key removed

# 3. Check if circuit breaker open
iex> :ets.lookup(:jwks_cache, {"partner_abc", :metadata})
[{{"partner_abc", :metadata}, %{consecutive_unknown_kids: 5, ...}}]
# Circuit breaker threshold reached!

# 4. Diagnosis: Partner rotated keys, removed old key before our cache expired.
#    Requests with kid=2024-12-key triggered unknown kid, hit circuit breaker.
# Solution: Emergency purge or wait for circuit breaker timeout (60s)

    
  

This debugging session reveals the partner removed an old key before our cache TTL expired, triggering circuit breaker activation. Solution: wait 60 seconds for automatic reset, or perform emergency purge to force immediate refresh.

Disaster Recovery & Business Continuity

Production systems must recover quickly from failures. This section covers cache warming, ETS recovery, and emergency procedures specific to multi-tenant JWKS caching.

Cache Warming on Startup

When nodes restart (deployment, crash, scaling event), the ETS cache is empty. Cache warming prevents a thundering herd of JWKS fetches when the first requests arrive for 200 partners simultaneously.

Call the warm_cache/0 function from Application.start/2 or as a supervised task in your supervision tree. The function uses Task.async_stream with max_concurrency: 50 to parallelize JWKS fetches while limiting concurrent HTTP connections.

ETS Table Recovery

ETS tables are in-memory and lost when the owning GenServer process crashes. Recovery strategies:

  • Supervised GenServer: Automatic restart, then warm_cache/0 restores state
  • Graceful degradation: Stale-while-revalidate pattern handles missing cache (fetches on demand)
  • Persistent fallback (optional): Store last-known-good JWKS in database for critical partners

Supervision tree with cache warming:

  
  
    
defmodule JWSDemo.Application do
  def start(_type, _args) do
    children = [
      JWSDemo.Repo,
      {JWSDemo.JWS.JWKSCache, []},  # Starts GenServer, creates ETS
      {Task, fn -> JWSDemo.JWS.JWKSCache.warm_cache() end}  # Warm after startup
    ]

    opts = [strategy: :one_for_one, name: JWSDemo.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

    
  

GenServer crashes are rare but possible (OOM, ETS corruption, unhandled exception). When it happens:

  1. Supervision tree detects crash and restarts GenServer
  2. GenServer init/1 creates fresh ETS table
  3. Supervised Task calls warm_cache/0
  4. Cache recovered within 30 seconds (RTO target)

Mass Partner Outage Recovery

What if 50+ partners' JWKS endpoints go down simultaneously (cloud provider outage, Border Gateway Protocol incident, regional datacenter failure)?

Automatic recovery procedure:

  1. Grace period activates: Stale cache serves requests (24-hour window), preventing legitimate transaction rejections
  2. Background refresh retries: Each partner's background refresh task retries with exponential backoff
  3. Graduated alerts fire: Ops team notified at WARNING → ERROR → CRITICAL severity levels
  4. Automatic recovery: When partner endpoints return, background refresh succeeds and updates cache
  5. No manual intervention required: System self-heals when connectivity restored

During the outage, requests continue processing with stale keys (risk accepted for availability). Graduated alerting ensures human oversight for security decisions. See Grace Period Security section for complete safeguards.

Emergency Cache Purge (Security Incident)

Already covered in Safeguard 3: Emergency Cache Purge in the Grace Period Security section. Quick reference for disaster recovery context:

  
  
    
# Immediate purge during security incident
JWKSCache.emergency_purge(
  "partner_abc",
  "ops.alice@example.com",
  "INC-2025-001: Partner confirmed private key compromise via phone"
)

# Result:
# - All cached keys for partner_abc removed from ETS
# - Next request triggers fresh JWKS fetch
# - Audit log entry created (regulatory compliance)
# - Telemetry event emitted to monitoring systems

    
  

Recovery Time Objectives (RTO)

Production systems require clear recovery targets:

Failure Scenario RTO Recovery Mechanism
Single partner JWKS down 0s Stale cache serves requests immediately
Node restart (deployment) 30s Supervised restart + cache warming (50 parallel)
ETS corruption/crash 30s GenServer restart + cache warming
Mass partner outage (50+) 0s Grace period + background refresh retry
Circuit breaker open 60s Automatic reset on successful cache hit
Security incident (key compromise) 5min Manual emergency purge + out-of-band verification

These RTOs balance availability (minimize downtime) with security (verify incidents before resuming). The 0s RTO for common failures (partner outages) reflects the business priority of not blocking legitimate transactions during temporary infrastructure issues.

Recovery Testing

Test disaster recovery procedures quarterly to verify RTOs and identify gaps:

  • Node crash simulation: Kill BEAM process in staging (kill -9 $PID), verify cache warming completes in < 30s
  • Verify warm cache telemetry: Check [:jwks_cache, :warm_cache_complete] event fires with success_count/total metrics
  • Partner endpoint failure: Block partner JWKS with firewall rule, verify stale cache serves requests
  • Background refresh retry: Confirm exponential backoff logged, automatic recovery when firewall removed
  • Security incident drill: Trigger emergency purge, verify audit trail created, cache cleared, telemetry fired
  • Circuit breaker reset: Trigger circuit breaker (5 consecutive unknown kids), wait 60s, verify automatic reset on successful fetch
  • Measure actual vs target RTO: Document deltas, update procedures or targets as needed

Document test results in your operational runbook. Update recovery procedures based on findings. Schedule next drill in calendar (quarterly cadence ensures procedures stay fresh).

Wrapping Up

Production JWKS is about more than just publishing keys and caching them. It's about:

  • Multi-tenancy: Managing 200+ partners with different configurations, endpoints, and SLAs
  • Resilience: Graceful degradation when partner endpoints fail, with stale-while-revalidate caching
  • Security: Protecting the grace period from compound attacks with graduated alerting and emergency purge capabilities
  • Operations: Monitoring, runbooks, incident response, and regular key rotation automation
  • Testing & Recovery: Load testing at scale, disaster recovery procedures, and 30-second cache warming for rapid failover

Key takeaways from this post:

  1. Multi-tenant architecture requires per-partner configuration - Different algorithms, TTLs, clock skew tolerance
  2. Stale-while-revalidate is essential for availability - But requires security safeguards for financial services
  3. Grace periods need human-in-the-loop verification - Automatic for technical issues, manual override for security incidents
  4. Emergency purge is not over-engineering - It's standard practice for financial services handling irreversible transactions
  5. Monitoring and alerting are critical - You need visibility into all 200 partners' key health
  6. Circuit breakers and rate limiting protect you - One partner's outage shouldn't impact others
  7. Disaster recovery enables rapid restoration - Cache warming, supervised restart, and graceful degradation minimize downtime to 30 seconds or less

From capability to maturity:

The previous post gave you the capability to implement JWKS. This post gave you the operational maturity to run it in production at scale. The difference shows up during:

  • Incidents: Do you wake up at 3 AM to manually fix key issues? Or do automated safeguards handle common failures while alerting on anomalies?
  • Audits: Can you prove you have incident response procedures? Can you show the audit trail of emergency cache purges?
  • Partner onboarding: Does adding a new partner require code changes and deployments? Or database configuration?
  • Security reviews: Can you explain how you protect against compound attacks? Do you have runbooks?

Production checklist:

Before going live with multi-tenant JWKS in financial services:

  • Per-partner configuration with allowed algorithms and clock skew
  • ETS-backed caching for microsecond lookups
  • Stale-while-revalidate with 24-hour grace period
  • Graduated alerting (WARNING → ERROR → CRITICAL → EMERGENCY)
  • Emergency purge function with audit logging
  • Out-of-band partner contact list
  • Incident response runbook documented and tested
  • Telemetry integration (Datadog, PagerDuty, etc.)
  • Per-partner circuit breakers
  • Database indexes on partner lookups
  • Monitoring dashboards for all partners
  • Quarterly incident response drills

What's next:

You now have JWKS running at scale with proper security and operational safeguards. But signatures only prove a message was sent. How do you store that proof so it's still valid years later when a regulator asks?

Our next focus will be on the long game: building audit trails that survive disputes years later. You'll learn how to store JWS signatures for regulatory compliance, handle canonicalization and replay attacks, deal with clock skew, and create the "forever proof" that wins disputes and satisfies auditors.

 This article is part of a series on understanding the hows and whys of JSON Web Signatures (JWS).

 There's accompanying code: it's refered to and linked throughout the content. But if you'd rather just read raw code, head over here.