JWKS in Production
Multi-Tenancy, Security, and Operations
This article is part of a series on understanding the hows and whys of JSON Web Signatures (JWS).
There's accompanying code: it's refered to and linked throughout the content. But if you'd rather just read raw code, head over here.
This is a comprehensive reference guide covering aspects of production JWKS systems. Some may find it long and tedious, so here are key sections you may want to jump to:
- Architecture: Per-Partner JWKS Cache
- DoS Protection
- Grace Period Security Safeguards
- Operational Best Practices
- Security Considerations (including Replay Attack Prevention)
- Testing, Troubleshooting & Disaster Recovery
We've previously gone over
how to implement JWKS and achieve zero-downtime key rotation. You have the
fundamentals: publishing keys at /.well-known/jwks.json, the
four-phase rotation process, basic TTL-based caching. That's enough to get
JWKS running.
But production systems—especially in financial services—face challenges that don't appear in tutorials:
- Multi-tenancy: You're not verifying tokens from one partner. You're verifying from 200 issuer banks, each with their own JWKS endpoint, rotation schedule, and operational quirks.
- Partner outages: When a partner's JWKS endpoint goes down at 3 AM, do you reject all their legitimate transactions? Or use stale cached keys and risk accepting compromised signatures?
- Security incidents: What happens when an attacker steals a partner's private key and DoS their JWKS endpoint simultaneously?
- Operational maturity: How do you monitor 200 partners' key health? Who gets paged when caches go stale? What's the incident response runbook?
This post covers production-grade JWKS architecture: multi-tenant caching, graceful degradation, security safeguards for grace periods, and operational patterns that let you sleep at night.
We'll be building on the fundamentals introduced in the previous post so if you haven't read it yet, do so now.
Multi-Tenancy: Managing Keys for Many Partners
If you're building a payment scheme API, you're not just verifying signatures from one partner, you're verifying from dozens or hundreds. Each partner has their own JWKS endpoint, their own key rotation schedule, their own operational quirks. This is where JWKS architecture gets interesting.
The Challenge
Your scheme processes authorization requests from 200 issuer banks. Each issuer:
- Publishes their own JWKS at their own domain
- Rotates keys on their own schedule
- May have 2-5 active keys at any given time
- Has different SLAs for JWKS endpoint availability
Naive approach:
# DON'T - Fetching JWKS on every request
# This is slow, unreliable, and will kill your latency
def verify_partner_signature(jws, partner_id) do
partner_domain = get_partner_domain(partner_id)
{:ok, jwks} = Req.get("https://#{partner_domain}/.well-known/jwks.json")
# Extract kid and find matching key
{:ok, kid} = extract_kid(jws)
key = find_key_in_jwks(jwks, kid)
# Verify signature
JOSE.JWS.verify_strict(key, ["ES256"], jws)
end
Problems:
- Makes a network call on every verification (adds 50-200ms latency)
- If a partner JWKS endpoint goes down, all their requests will fail
- No caching means there's a DoS vulnerability (attacker sends many requests, you make many JWKS fetches)
- Partner endpoints (which aren't under your control) can impact your own availability
Architecture: Per-Partner JWKS Cache
A production-grade JWKS architecture for multi-tenant systems must satisfy three core requirements:
- Isolation: One partner's key issues cannot impact others. Each partner gets their own cache namespace with independent TTL, grace period, and error handling.
- Performance: Key lookups must complete in microseconds, not milliseconds. Cache hits avoid network calls entirely, reducing latency 100-2000x compared to fetching JWKS on every request.
- Graceful degradation: Temporary partner outages should not block legitimate transactions. Stale keys remain usable during endpoint unavailability, with automatic recovery when service resumes.
Cache Structure
The cache uses ETS (Erlang Term Storage) for lock-free concurrent reads, with composite keys combining partner identity and key ID:
# ETS table structure: :jwks_cache
# Entry format: {{partner_id, kid}, jwk, cached_at, ttl}
:ets.insert(:jwks_cache, {
{"partner_abc", "2025-01-key"}, # Composite key
jwk, # JOSE.JWK structure
1704470400, # Unix timestamp (cached_at)
900 # TTL in seconds (15 minutes)
})
Each partner maintains independent metadata tracking fetch history, success rates, and DoS protection state. This enables per-partner tuning and isolated error handling.
Cache States and Decision Logic
Cache lookups follow a three-state model with different response strategies:
case :ets.lookup(:jwks_cache, {partner_id, kid}) do
[{_, jwk, cached_at, ttl}] ->
age = now() - cached_at
cond do
# FRESH: Within TTL, return immediately (~100μs)
age < ttl ->
{:ok, jwk}
# STALE: Expired but within 24h grace period
# Return cached key + trigger background refresh
age < 86_400 ->
spawn(fn -> refresh_jwks(partner_id) end)
{:ok, jwk} # Don't block request!
# TOO STALE: Must refresh, block on HTTP call
true ->
fetch_and_cache_jwks(partner_id)
retry_lookup(partner_id, kid)
end
[] ->
# Unknown kid - apply DoS protection before fetching
handle_cache_miss_with_protection(partner_id, kid)
end
The stale-while-revalidate pattern (second branch) is critical for availability. When a partner's JWKS endpoint experiences an outage, their cached keys remain usable for up to 24 hours, preventing legitimate transaction rejections. The background refresh ensures automatic recovery when the endpoint returns to service.
Note that when we
spawn(fn -> refresh_jwks(partner_id) end)
we deliberately use spawn/1
rather than spawn_link/1: we don't
want failures in JWKS fetching bring down the parent process.
Important: The
fetch_and_cache_jwks/1
call includes critical
safeguards: 5-second HTTP timeout (receive_timeout: 5000), DoS
protection layers (circuit breaker, rate limiting, debouncing via
handle_cache_miss_with_protection/3), and error handling. The
"TOO STALE" path (cache > 24h old) should be rare in healthy systems - it
primarily handles cold starts or extended partner outages. See the
implementation
for complete timeout and retry configuration.
The Unknown Kid Challenge
The cache miss path (final branch) requires special handling. When a
JWS references a kid not in cache, we must fetch fresh JWKS
to check if the key legitimately exists. However, this creates a DoS
vulnerability: an attacker can send JWS signatures with
randomly-generated kid values, forcing repeated JWKS
fetches. Each malicious request triggers an HTTP call to the partner's
endpoint, amplifying the attack 100-1000x.
This vulnerability requires defense-in-depth protections, covered in the next section.
Implementation Reference
See the demo app for complete code including:
- Cache state handling with age-based decisions here
- DoS protection for unknown kid attacks here
- Graduated alerting for stale grace period here
- Emergency cache purge for security incidents here
The implementation demonstrates production patterns including telemetry integration, circuit breakers, rate limiting, and operational safeguards for financial services.
DoS Protection: Safeguarding Against Unknown Kid Attacks
The Vulnerability
Consider what happens when signature verification receives a JWS with
an unknown kid value:
def verify_signature(jws, partner_id) do
{:ok, kid} = extract_kid(jws) # Attacker controls this!
{:ok, jwk} = get_key(partner_id, kid) # Cache miss = HTTP fetch
JOSE.JWS.verify_strict(jwk, ["ES256"], jws)
end
The attack: An attacker sends 1,000 requests/second,
each with a different randomly-generated kid value like
"attack-00001", "attack-00002", etc. Your cache has never seen these keys,
so each request triggers a fresh JWKS fetch to the partner's endpoint.
Result: 1,000 outbound HTTP requests/second, overwhelming
either your HTTP client connection pool or the partner's JWKS
endpoint.
Attack characteristics:
- Amplification: 1 malicious request → 1 HTTP fetch (50-200ms latency penalty)
- Resource exhaustion: Connection pool depletion, partner endpoint overload
- Plausible deniability: Requests look legitimate (valid JWS structure, real partner ID)
- Multi-target: Attacker can rotate through different partner IDs to evade per-partner limits
Three-Layer Defense Strategy
Protection requires multiple safeguards working together. Each layer addresses a different attack dimension:
Layer 1: Debouncing (60-second window)
Once we fetch JWKS for a partner, don't fetch again for 60 seconds
regardless of how many unknown kids are requested. This eliminates the
amplification factor.
defp check_recent_fetch(partner_id, now) do
metadata_key = {partner_id, :metadata}
case :ets.lookup(:jwks_cache, metadata_key) do
[{^metadata_key, metadata}] ->
age = now - metadata.last_fetch_at
if age < 60 do # Debounce window
{:debounced, age} # Too recent, reject fetch
else
:ok # Enough time passed, allow fetch
end
[] ->
:ok # No prior fetch, allow
end
end
Effect: Attacker can trigger at most 1 JWKS fetch per 60 seconds per partner, regardless of request volume. 1,000 requests with different kids → 1 HTTP fetch.
Tuning: Increase window to 120s for partners with slow JWKS endpoints. Decrease to 30s for partners with frequent legitimate key rotations.
Layer 2: Rate Limiting (10 attempts/minute)
Track how many times unknown kids are requested per partner per minute.
Block requests exceeding threshold.
defp check_unknown_kid_rate_limit(partner_id, now) do
rate_limit_key = {partner_id, :rate_limit}
window = 60 # 1 minute
case :ets.lookup(:jwks_cache, rate_limit_key) do
[{^rate_limit_key, count, window_start}]
when now - window_start < window ->
if count >= 10 do # Rate limit threshold
:telemetry.execute(
[:jwks_cache, :rate_limit_exceeded],
%{attempts: count},
%{partner_id: partner_id}
)
{:error, :rate_limited}
else
# Increment counter within same window
:ets.update_counter(:jwks_cache, rate_limit_key, {2, 1})
:ok
end
_ ->
# Start new window
:ets.insert(:jwks_cache, {rate_limit_key, 1, now})
:ok
end
end
Effect: Even if debouncing allows through some unknown kid requests, we won't process more than 10 per minute per partner. This limits attack impact during the debounce window.
Tuning: For high-volume partners actively rotating keys, increase to 20-30/min. For low-volume partners, decrease to 5/min.
Layer 3: Circuit Breaker (5 consecutive failures)
Track consecutive unknown kid rejections. After threshold, open circuit
and reject all requests for that partner until manual review.
defp check_circuit_breaker(partner_id) do
metadata_key = {partner_id, :metadata}
case :ets.lookup(:jwks_cache, metadata_key) do
[{^metadata_key, metadata}] ->
if metadata.consecutive_unknown_kids >= 5 do # Threshold
:telemetry.execute(
[:jwks_cache, :circuit_breaker_open],
%{consecutive_unknown_kids: metadata.consecutive_unknown_kids},
%{partner_id: partner_id}
)
Logger.error("""
Circuit breaker OPEN for #{partner_id}
Too many consecutive unknown kids (#{metadata.consecutive_unknown_kids})
Manual review required
""")
{:error, :circuit_breaker_open}
else
:ok
end
[] ->
:ok # No metadata yet
end
end
Effect: Sustained attacks trigger circuit breaker, halting all requests until ops investigates. Prevents prolonged resource consumption. Counter resets on any successful cache hit.
Tuning: For partners with frequent key rotation issues, increase to 10-15 consecutive failures. For critical partners, decrease to 3 to trigger investigation faster.
How the Layers Work Together
defp handle_cache_miss_with_protection(partner_id, kid, state) do
now = System.system_time(:second)
with :ok <- check_circuit_breaker(partner_id),
:ok <- check_unknown_kid_rate_limit(partner_id, now),
:ok <- check_recent_fetch(partner_id, now) do
# All protections passed - safe to fetch JWKS
Logger.info("All protection checks passed for #{partner_id}/#{kid}, fetching JWKS")
handle_cache_miss(partner_id, kid, state, now)
else
{:error, :circuit_breaker_open} = error ->
Logger.error("Circuit breaker OPEN for #{partner_id}, rejecting kid: #{kid}")
{:reply, error, state}
{:error, :rate_limited} = error ->
Logger.error("Rate limit EXCEEDED for #{partner_id}, rejecting kid: #{kid}")
{:reply, error, state}
{:debounced, age} ->
# We just fetched JWKS, kid truly doesn't exist
Logger.warning("Unknown kid #{kid}, JWKS fetched #{age}s ago")
increment_unknown_kid_counter(partner_id)
{:reply, {:error, :kid_not_found_in_jwks}, state}
end
end
Complete implementation here with full logging and telemetry integration.
Attack scenario walkthrough:
- t=0s: Attacker sends 100 requests with unknown kids
- First request passes all checks, triggers JWKS fetch
- Next 9 requests (within same second) pass rate limit, but debouncing blocks JWKS fetch. Kid not found, counter increments.
- Request #11 hits rate limit (10/min), rejected immediately
- Requests #12-100 also hit rate limit, rejected
- After 5 consecutive unknown kids, circuit breaker opens
- All subsequent requests rejected until manual review
Result: 100 attack requests → 1 JWKS fetch, then complete blocking. Attack neutralized.
Implementation Details
The three-layer defense strategy relies on per-partner state tracking and comprehensive observability. Let's examine how metadata is stored and how telemetry provides visibility into DoS attacks.
Each partner's protection state is tracked in ETS metadata:
# Stored in ETS as: {{partner_id, :metadata}, metadata}
metadata = %{
# Last time we fetched JWKS (unix seconds)
last_fetch_at: 1704470400,
# Whether last fetch succeeded
last_fetch_success: true,
# Circuit breaker counter (resets on cache hit)
consecutive_unknown_kids: 0
}
# Separate rate limit counter:
# {{partner_id, :rate_limit}, count, window_start}
Metadata updates occur at:
- JWKS fetch: Update
last_fetch_atandlast_fetch_success( code) - Cache hit: Reset
consecutive_unknown_kidsto 0 ( code) - Unknown kid: Increment
consecutive_unknown_kids( code) - Unknown kid request: Increment rate limit counter ( code)
The DoS protection system emits telemetry events for observability:
# Unknown kid rejected after debouncing
:telemetry.execute(
[:jwks_cache, :unknown_kid_rejected],
%{age_since_fetch: 30}, # Seconds since last fetch
%{partner_id: "partner_abc", kid: "attack-00001"}
)
# Rate limit exceeded
:telemetry.execute(
[:jwks_cache, :rate_limit_exceeded],
%{attempts: 15}, # Total attempts in window
%{partner_id: "partner_abc"}
)
# Circuit breaker opened
:telemetry.execute(
[:jwks_cache, :circuit_breaker_open],
%{consecutive_unknown_kids: 5},
%{partner_id: "partner_abc"}
)
# Counter incremented (helps track trends)
:telemetry.execute(
[:jwks_cache, :unknown_kid_incremented],
%{consecutive_count: 3},
%{partner_id: "partner_abc"}
)
Recommended alerts:
- WARNING: Rate limit exceeded 3+ times in 5 minutes (possible attack)
- ERROR: Circuit breaker opened (requires manual review)
- INFO: Unknown kid counter > 3 (monitor for patterns)
Integrate telemetry with Datadog, New Relic, or your observability platform to visualize attack patterns and tune thresholds.
Tuning Recommendations
Adjust protection thresholds based on partner characteristics:
| Partner Type | Debounce | Rate Limit | Circuit Breaker | Rationale |
|---|---|---|---|---|
| High-volume, stable | 60s | 10/min | 5 consecutive | Standard baseline |
| High-volume, frequent rotation | 30s | 20/min | 10 consecutive | Allow legitimate key churn |
| Low-volume | 120s | 5/min | 3 consecutive | Stricter limits, fewer false positives |
| Slow JWKS endpoint | 180s | 10/min | 5 consecutive | Reduce load on partner infrastructure |
| Development/staging | 10s | 50/min | 20 consecutive | Lenient for testing |
Configuration approach: Store thresholds in partner configuration (database) rather than hardcoding. This enables per-partner tuning and A/B testing of threshold adjustments.
Implementation Checklist
Before deploying DoS protection to production:
- Configure debounce window per environment (60s production, 10s staging)
- Set rate limit thresholds based on partner risk tier (reference table above)
- Connect telemetry events to monitoring system (Datadog, New Relic, Prometheus)
- Create runbook for circuit breaker alerts (who investigates, escalation path)
- Test DoS protection with load testing tool (send 1000 req/s with random kids)
- Document tuning decisions in ADR (Architecture Decision Record)
- Set up dashboards showing: rate limit hits, circuit breaker events, unknown kid trends
- Configure alerts with appropriate severity levels (warning vs error)
- Test circuit breaker reset procedure (manual counter reset)
- Review protection effectiveness quarterly, adjust thresholds based on data
Key Takeaways
Unknown kid attacks exploit the cache miss path, forcing repeated JWKS fetches. A single protection mechanism isn't sufficient:
- Debouncing eliminates amplification (1000 requests → 1 fetch)
- Rate limiting contains burst attacks within the debounce window
- Circuit breakers halt sustained attacks and trigger human investigation
The three layers combine to provide defense-in-depth: most attacks blocked by debouncing, sophisticated attacks caught by rate limiting, coordinated attacks halted by circuit breaker.
Remember: These protections are distinct from the grace period security safeguards (covered below). DoS protection addresses attack amplification via cache misses, while grace period security addresses compound attacks exploiting stale key usage during partner endpoint outages. Both are essential for production financial services.
Key Lookup Strategy
Database vs. Cache vs. ETS:
For 200+ partners, you need fast lookups:
# Strategy: Tiered lookup
# 1. ETS (microseconds) - Hot partners
# 2. Database (milliseconds) - Configuration
# 3. HTTP (50-200ms) - JWKS fetch
def verify_multi_tenant(jws, partner_id) do
with {:ok, config} <- get_partner_config(partner_id), # DB/ETS
true <- config.active,
{:ok, kid} <- extract_kid(jws),
{:ok, key} <- PartnerJWKSCache.get_key(partner_id, kid),
{:ok, claims} <-
verify_signature(jws, key, config.allowed_algorithms),
:ok <- validate_claims(claims, config.clock_skew_tolerance) do
{:ok, claims}
else
{:error, reason} -> {:error, reason}
false -> {:error, :partner_inactive}
end
end
Performance characteristics:
- Partner config lookup: ~100μs (ETS) or ~1-2ms (PostgreSQL with connection pool)
- JWKS key lookup: ~100μs (cache hit) or ~50-200ms (cache miss, HTTP fetch)
- Signature verification: ~1-5ms (ES256)
- Total: ~1-7ms for cache hit, ~50-210ms for cache miss
See code for details on cache hit optimization and TTL strategy (fresh cache ~100μs, stale-while-revalidate pattern, DoS protection on cache miss).
Monitoring Multi-Tenant Key Operations
Metrics you need:
# Per-partner metrics
Telemetry.execute([:partner_jwks, :fetch], %{duration: duration}, %{
partner_id: partner_id,
result: :success | :failure,
cache_hit: true | false
})
Telemetry.execute([:partner_signature, :verify], %{duration: duration}, %{
partner_id: partner_id,
kid: kid,
result: :success | :failure,
failure_reason: reason
})
# Aggregate metrics
Telemetry.execute([:jwks, :cache], %{
total_partners: 200,
cached_partners: 185,
stale_partners: 10,
failed_partners: 5
})
Dashboard queries:
- Which partners have the highest verification failure rates?
- Which partners' JWKS endpoints are frequently unavailable?
- Which
kidvalues are most commonly used per partner? - Are any partners using disallowed algorithms?
Partner Onboarding Checklist
Before adding a new partner to production:
- Partner provides JWKS URL: https://auth.issuer-abc.example/.well-known/jwks.json
- Verify JWKS is accessible and returns valid JSON
- Identify which algorithms they use (ES256, RS256, etc.)
- Partner provides sample signed request for testing
- Verify we can validate their sample signature
- Add partner configuration to database:
- partner_id
- domain
- allowed_algorithms
- clock_skew_tolerance (default: 300s)
- Deploy configuration to staging
- Run contract tests with partner's test environment
- Partner validates we can verify requests they sign
- Set up monitoring alerts for this partner
- Document partner-specific operational notes (e.g., "rotates keys monthly")
- Add partner to production with
jws_enforcement_mode: :warning - Monitor for 7 days in warning mode
- Switch to
jws_enforcement_mode: :enforced
Scaling Considerations
For 500+ partners:
JWKS fetch parallelization:
# Warm cache for all partners on startup
def warm_cache() do
Repo.all(PartnerConfig)
|> Task.async_stream(&fetch_and_cache_jwks/1, max_concurrency: 50, timeout: 10_000)
|> Stream.run()
end
Full cache warming implementation here with error handling, success/failure tracking, and telemetry integration.
Circuit breakers per partner:
# Don't let one partner's attacks affect others
# Circuit opens after 5 consecutive unknown kids
def check_circuit_breaker(partner_id) do
metadata = get_partner_metadata(partner_id)
if metadata.consecutive_unknown_kids >= 5 do
{:error, :circuit_breaker_open}
else
:ok
end
end
Circuit breaker implementation here (threshold-based, per-partner state, automatic telemetry).
Database indexes:
CREATE INDEX idx_partner_configs_partner_id ON partner_configs(partner_id);
CREATE INDEX idx_partner_configs_active ON partner_configs(active) WHERE active = true;
Essential for fast partner config lookups when warming cache or handling requests for 500+ partners.
ETS table optimization:
# Single ETS table with read_concurrency for multi-core systems
:ets.new(:jwks_cache, [:set, :public, :named_table, read_concurrency: true])
Code
for ETS table initialization with
read_concurrency: true
for efficient concurrent reads
across multiple BEAM schedulers.
What Can Go Wrong
Partner rotates all keys simultaneously
Partner removes old key from JWKS before your cache expires. Requests with old kid fail.
Mitigation:
- Educate partners on zero-downtime rotation (overlap period)
- Use stale cache as fallback (covered in next section)
- Alert when unknown
kidspike occurs
Partner JWKS has 100+ keys
Some large institutions never remove old keys. JWKS becomes 500KB.
Mitigation:
- Limit JWKS fetch size (reject responses >1MB)
- Cache only keys seen in last 90 days
- Work with partner to clean up old keys
Two partners share a JWKS endpoint
Partner A and Partner B are subsidiaries of same company, publish JWKS at same URL.
Solution:
%PartnerConfig{
partner_id: "partner_a",
jwks_url: "https://shared.example/.well-known/jwks.json",
allowed_kids: ["partner-a-2025-01", "partner-a-2025-02"] # Filter by kid
}
%PartnerConfig{
partner_id: "partner_b",
jwks_url: "https://shared.example/.well-known/jwks.json",
allowed_kids: ["partner-b-2025-01", "partner-b-2025-02"]
}
Multi-tenancy adds operational complexity, but it's the reality of payment scheme APIs. The alternative (i.e., managing keys for 200 partners manually) is far worse.
Security Safeguards: Protecting the Grace Period
Why We Need a Grace Period: The Cost of Blocking Legitimate Transactions
Before discussing security risks, let's understand why the 24-hour stale grace period exists and why simply removing it isn't an option for financial services.
The Business Problem: Technical Hiccups Cost Money
In financial services, blocking legitimate transactions has real financial consequences:
- Lost revenue: Partner can't process €10M in daily payment volume during your 2-hour JWKS cache outage
- SLA penalties: Contracts often include 99.9% uptime requirements with financial penalties for breaches
- Regulatory impact: Payment Service Providers must maintain operational continuity under PSD2
- Customer churn: Merchants switch payment providers after repeated authorization failures
- Reputational damage: "Payment gateway DOWN" trending on social media damages both parties
Common Failure Scenarios
JWKS endpoints can become temporarily unavailable for many non-malicious reasons:
- Network partition: ISP routing issue between your datacenter and partner's JWKS endpoint (30 minutes to 2 hours)
- DNS propagation: Partner moves JWKS endpoint, DNS takes time to propagate globally (15-60 minutes)
- Certificate renewal: Partner's TLS certificate expires, they're scrambling to renew (1-4 hours)
- Cloud provider outage: Partner's CDN has regional outage affecting JWKS delivery (2-8 hours)
- DDoS mitigation: Partner's Web Application Firefall blocks legitimate traffic during attack response (variable duration)
- Deployment gone wrong: Partner's infrastructure change breaks JWKS endpoint, rollback needed (30 minutes to 4 hours)
- Rate limiting: Your cache refresh hits partner's rate limits during high traffic (until rate limit window resets)
The Strict Approach: Fail Immediately on Cache Miss
Without a grace period, here's what happens when partner's JWKS endpoint is unreachable:
def get_key(partner_id, kid) do
with {:ok, jwk, cached_at} <- fetch_from_cache(partner_id, kid),
true <- fresh?(cached_at) do
# Fresh cache hit
{:ok, jwk}
else
# Expired or not cached - must fetch fresh JWKS
_ ->
case fetch_fresh_jwks(partner_id) do
{:ok, jwks} -> cache_and_return(jwks, kid)
# FAIL IMMEDIATELY - no grace period fallback
{:error, _} -> {:error, :jwks_unavailable}
end
end
end
Impact of immediate failure:
- All requests rejected with 401 Unauthorized until JWKS endpoint recovers
- Cached keys that are still cryptographically valid are discarded
- Legitimate transactions blocked due to infrastructure issues unrelated to security
- Partner's business stops until their ops team fixes the endpoint
Example: The 2-Hour Outage
Let's walk through a real scenario:
- 10:00 AM: Partner's infrastructure team deploys a configuration change
- 10:05 AM: JWKS endpoint returns 503 Service Unavailable (misconfigured load balancer)
- 10:15 AM: Your cache expires (15-minute TTL), attempts refresh, gets 503
- Without grace period:
- 10:15-12:00: All partner requests rejected (1h 45m of downtime)
- €300K in payment authorizations blocked
- Customers calling merchants: "Why isn't my payment working?"
- Merchants calling partner: "Your payment gateway is down!"
- Partner calling you: "Why are you rejecting our valid requests?"
- With 24-hour grace period:
- 10:15-12:00: Stale keys still work, all transactions succeed
- Zero customer impact
- Ops team has time to investigate (not emergency page at 3 AM)
- Partner fixes endpoint by 12:00, next cache refresh succeeds
The Grace Period Solution: Stale-While-Revalidate
The grace period implements the stale-while-revalidate caching pattern:
def get_key(partner_id, kid) do
case fetch_from_cache(partner_id, kid) do
{:ok, jwk, cached_at} ->
age = now() - cached_at
cond do
# FRESH: Return immediately
age < 15_minutes ->
{:ok, jwk}
# STALE BUT USABLE: Return cached + trigger background refresh
age < 24_hours ->
spawn(fn -> refresh_cache(partner_id) end) # Don't block!
{:ok, jwk} # Still works during partner outage
# TOO STALE: Must refresh
true ->
case fetch_fresh_jwks(partner_id) do
{:ok, jwks} -> cache_and_return(jwks, kid)
{:error, _} -> {:error, :jwks_unavailable}
end
end
end
end
Benefits:
- Zero downtime for temporary partner outages (less than 24 hours)
- Legitimate transactions continue even when JWKS endpoint is unreachable
- Automatic recovery when endpoint comes back online
- Non-blocking refresh attempts in background
- Graceful degradation rather than hard failure
Why 24 hours specifically?
- Covers most infrastructure outages (network issues, deployments, certificate problems)
- Gives ops teams time to respond during business hours (no 3 AM emergency pages for transient issues)
- Accounts for global timezones (partner's ops team might be asleep during your peak traffic)
- Allows for vendor support ticket resolution times
- Balances availability (long grace period) with security (not infinite)
The Security Risk: Compound Attack Scenarios
Now that we understand why the grace period is essential, we must address the security risk it creates.
The grace period assumes that stale keys are still trustworthy. This is true for infrastructure outages, but false during security incidents.
The Compound Attack Vector
A sophisticated attacker can exploit the grace period by executing a compound attack:
- Compromise Partner A's private key
- Phishing attack on partner's ops team
- Malware on partner's key management system
- Insider threat (disgruntled employee)
- Supply chain attack on partner's infrastructure
- Simultaneously launch DoS attack on Partner A's JWKS endpoint
- Volumetric attack (saturate bandwidth)
- Application-layer attack (exhaust server resources)
- DNS attack (make endpoint unreachable)
- Your system behavior:
- JWKS refresh attempts fail (endpoint unreachable)
- Grace period activates (use stale cached keys)
- Cached keys include the now-compromised key's public counterpart
- Attacker signs malicious payloads
- Payment authorizations for attacker-controlled accounts
- Fraudulent transaction instructions
- Account modifications or data exfiltration
- You accept them as valid
- Signature verifies correctly (attacker has the private key!)
- Grace period allows stale keys (can't fetch fresh JWKS to see revocation)
- Attack window: Up to 24 hours
Why This Attack is Realistic
This isn't theoretical - it combines two common attack patterns:
- Key compromise: Happens regularly (phishing, malware, insider threats)
- DoS attacks: Common and easy to execute (botnets, amplification attacks)
A sophisticated attacker (nation-state, organized crime) can execute both.
Attack Window Analysis
| Scenario | Attack Window | Availability Impact | Security Risk |
|---|---|---|---|
| No grace period | ~15 minutes (TTL) | HIGH (blocks legitimate traffic) | LOW (short window) |
| 24-hour grace period | Up to 24 hours | LOW (maintains availability) | HIGH (long window) |
| Grace period + safeguards | Minutes to hours (human response time) | LOW (maintains availability) | LOW (human verification) |
For financial services handling irreversible transactions, a 24-hour attack window is unacceptable without safeguards.
Defense in Depth: Human-in-the-Loop Verification
The solution is not to eliminate the grace period (that sacrifices availability), but to add monitoring, alerting, and manual override capabilities.
Core principle: Use the grace period for technical issues (automatic), but enable human intervention for security incidents (manual override).
Safeguard 1: Graduated Alerting
Alert ops teams when cache enters stale grace period, with severity escalation based on cache age:
| Cache Age | Severity | Response | Rationale |
|---|---|---|---|
| less than 1 hour | WARNING | Ops notified via Slack/email | Likely transient issue, monitor |
| 1-4 hours | ERROR | Escalate to senior ops | Extended outage, needs investigation |
| 4-12 hours | CRITICAL | Page on-call engineer | Unusual duration, possible incident |
| ≥ 12 hours | EMERGENCY | Executive notification | Approaching grace period limit |
Implementation:
# In jwks_cache.ex
defp alert_stale_cache(partner_id, kid, age_seconds, cached_at) do
severity = cond do
age_seconds < 3600 -> :warning # < 1 hour
age_seconds < 14_400 -> :error # < 4 hours
age_seconds < 43_200 -> :critical # < 12 hours
true -> :emergency # >= 12 hours
end
# Emit telemetry for monitoring systems
:telemetry.execute(
[:jwks_cache, :stale_grace_period],
%{age_seconds: age_seconds},
%{
partner_id: partner_id,
kid: kid,
severity: severity,
cached_at: cached_at
}
)
# Log actionable alert
if severity in [:critical, :emergency] do
Logger.log(severity, """
JWKS CACHE STALE GRACE PERIOD ACTIVE
Partner: #{partner_id}
Key ID: #{kid}
Cache age: #{format_duration(age_seconds)}
Severity: #{severity}
Last cached: #{format_timestamp(cached_at)}
ACTION REQUIRED:
1. Contact partner security team (out-of-band, use phone)
2. Verify JWKS endpoint status (technical vs security incident)
3. If key compromise confirmed: JWKSCache.emergency_purge("#{partner_id}", "your_name", "reason")
4. Document in incident ticket
""")
end
end
Telemetry integration:
# In telemetry.ex - Connect to Datadog, New Relic, etc.
:telemetry.attach(
"jwks-stale-alert",
[:jwks_cache, :stale_grace_period],
&handle_stale_cache_alert/4,
nil
)
defp handle_stale_cache_alert(_event, measurements, metadata, _config) do
if metadata.severity in [:critical, :emergency] do
# Send to PagerDuty
PagerDuty.trigger_incident(%{
title: "JWKS cache stale for #{metadata.partner_id}",
severity: metadata.severity,
details: %{
age_seconds: measurements.age_seconds,
partner_id: metadata.partner_id,
kid: metadata.kid
}
})
# Send to Datadog
Datadog.event(%{
title: "JWKS Stale Grace Period",
text: "Partner #{metadata.partner_id} cache is #{measurements.age_seconds}s old",
alert_type: "error",
tags: ["partner:#{metadata.partner_id}", "severity:#{metadata.severity}"]
})
end
end
Safeguard 2: Out-of-Band Verification
When alert fires, ops must verify the situation with the partner out-of-band (outside potentially compromised channels).
Incident Response Runbook:
- Receive alert: "JWKS cache stale for partner_abc (5 hours)"
- Check partner status page: Any announced maintenance or incidents?
- Contact partner security team:
- Method: Phone call (NOT email - email could be compromised)
- Use: Emergency contact list (pre-arranged, verified contacts)
- Verify identity: Challenge questions or callback to known number
- Ask these questions:
- "Are you experiencing JWKS endpoint issues?"
- "Are you aware of any ongoing maintenance or deployment?"
- "Have you detected any security incidents in the last 24 hours?"
- "Have you rotated or revoked any signing keys recently?"
- "Should we continue processing requests or halt?"
- Decision tree:
- Partner confirms technical issue (deployment, DNS, network, etc.)
- Continue processing
- Monitor situation
- Document in ticke
- Follow up in 2h
- Partner confirms security incident (key compromise, breach, attack)
- EMERGENCY PURGE
- Halt processing
- Executive alert
- Incident response
- Unable to reach partner (no answer, out of office, etc.)
- Escalate to senior
- Reduce limits 50%
- Keep trying every 30 minutes
- Partner confirms technical issue (deployment, DNS, network, etc.)
Safeguard 3: Emergency Cache Purge
The emergency_purge/3 function provides a manual override for security incidents where a partner confirms private key compromise or security breach. When invoked, it immediately removes all cached keys for that partner. The next request triggers a fresh JWKS fetch, forcing validation against current keys.
The function requires senior ops approval and creates a complete audit trail including structured logs (for Security Information and Event Management (SIEM) ingestion), telemetry events (for monitoring alerts), and permanent compliance records with timestamp, operator, and business justification. If the partner's JWKS endpoint is still unavailable post-purge, requests fail closed with 401 (i.e.: security over availability). Multi-tenant isolation ensures other partners remain unaffected.
Safeguard 4: Audit Trail
All emergency purges logged for regulatory compliance and post-incident review:
defp log_cache_purge(partner_id, operator, reason, purged_count) do
# Structured log for SIEM ingestion
Logger.warning(
"JWKS cache emergency purge executed",
event: "jwks_cache_purge",
partner_id: partner_id,
operator: operator,
reason: reason,
purged_keys: purged_count,
timestamp: DateTime.utc_now() |> DateTime.to_iso8601()
)
# also write to dedicated audit table
Audit.log_operational_event(%{
event_type: "jwks_cache_purge",
partner_id: partner_id,
operator: operator,
reason: reason,
metadata: %{purged_keys: purged_count}
})
end
Audit trail captures:
- Timestamp (when purge executed)
- Operator identity (who executed it)
- Partner ID (which partner affected)
- Reason/justification (why it was necessary)
- Number of keys purged (scope of action)
- Incident reference (traceability to incident management system)
Supports:
- Regulatory compliance: SOX, PCI-DSS, PSD2 require audit of security actions
- Post-incident review: Timeline reconstruction for root cause analysis
- Accountability: Clear record of who made what decision when
- Pattern detection: Identify partners with frequent incidents
Additional Mitigations
Risk-Based Grace Periods
Vary grace period duration based on partner transaction volume:
defp get_grace_period(partner_id) do
case get_partner_risk_profile(partner_id) do
:high_volume -> 7_200 # 2 hours (€10M+ daily volume)
:medium_volume -> 28_800 # 8 hours (€1M-10M daily)
:low_volume -> 86_400 # 24 hours (< €1M daily)
end
end
Rationale:
- High-volume partners: Shorter window (higher potential fraud amount, more monitoring)
- Low-volume partners: Longer window (lower risk, less need for 24/7 ops response)
- Trade-off: Balances security risk with operational burden
Reduced Limits During Grace Period
Automatically reduce transaction limits when using stale keys:
defp handle_stale_cache(partner_id, age_seconds, jwk) do
# Reduce transaction limits during grace period
if age_seconds > @ttl do
RateLimiter.set_partner_limit(partner_id, :reduced)
# reduce limits to 50% of normal
TransactionLimits.reduce_max_amount(partner_id, 0.5)
end
{:ok, jwk}
end
Effect:
- Reduces blast radius if compromise goes undetected
- Automatic restoration when fresh keys loaded
Anomaly Detection
Monitor transaction patterns even when signatures are valid:
- Unusual transaction amounts: Flag amounts >2 standard deviations from partner's average
- Velocity anomalies: Partner suddenly processing 10x normal transaction volume
- Geographic anomalies: Transactions from countries partner doesn't operate in
- Time-of-day anomalies: Transactions at 3 AM when partner normally has zero activity
- Beneficiary patterns: Same beneficiary account receiving many transactions
During grace period: Require manual approval for flagged transactions.
Is This Over-Engineered?
For financial services: Absolutely not.
Why this is appropriate:
- Cost of failure is enormous
- Funds often unrecoverable once transferred
- Transactions are irreversible
- Payment authorizations can't easily be rolled back
- Money movement is permanent
- Cross-border transfers especially hard to reverse
- Regulatory requirements mandate it
- PSD2: Requires strong customer authentication and fraud monitoring
- SOX: Requires controls over financial reporting systems
- PCI-DSS: Requires monitoring of all access to systems
- GDPR: Requires security incident response capabilities
- Infrastructure already exists
- Financial institutions have 24/7 SOC (Security Operations Center)
- Ops teams already on-call for production incidents
- Partner contact lists already maintained for business continuity
- Incident management systems (PagerDuty, ServiceNow) already deployed
- SIEM systems (Splunk, Datadog) already ingesting logs
- Industry standard practice
- Banks have similar protocols for partner certificate revocation
- Payment networks (Visa, Mastercard) have similar incident response
- SWIFT has extensive security controls for messaging
For non-financial services: May be excessive. Consider:
- Transaction reversibility: Can you undo the action if fraud detected?
- Fraud impact: What's the worst-case financial loss?
- Regulatory requirements: Are you subject to financial services regulations?
- Operational maturity: Do you have 24/7 ops teams?
- Risk tolerance: What's your appetite for security vs availability trade-offs?
For a SaaS application with reversible actions and low fraud risk, a simple grace period without human-in-the-loop may suffice.
Implementation Checklist
P0 - Critical (implement before production):
- Telemetry events for stale cache entry
- Emergency purge function with authorization
- Basic audit logging for purge operations
- Incident response runbook documented
P1 - Important (implement within first month):
- Alert routing to ops team (PagerDuty, Datadog, etc.)
- Partner emergency contact list (verified phone numbers)
- Test emergency purge in staging environment
- Quarterly incident response drills
P2 - Enhancement (implement within first quarter):
- Risk-based grace periods by partner volume
- Reduced transaction limits during grace period
- Anomaly detection integration
- Automated partner outage detection (status page monitoring)
Ongoing:
- Review and update partner contact list monthly
- Test incident response quarterly (tabletop exercises)
- Review purge audit logs monthly for patterns
- Update runbook based on real incidents
Key Takeaways
- Grace periods are essential for financial services
- Blocking transactions due to technical hiccups costs real money
- Infrastructure outages are common and usually non-malicious
- Availability requirements often have SLA penalties
- But grace periods create security risk
- Compound attacks (key compromise + endpoint DoS) are realistic
- 24-hour attack window is too long for irreversible transactions
- Stale keys are trustworthy for technical issues, not security incidents
- Defense in depth is the answer
- Automated grace period for technical issues (common case)
- Human verification for potential security incidents (edge case)
- Manual override capability when compromise confirmed
- Not over-engineering for financial services
- Standard practice aligned with industry regulations
- Infrastructure and processes already exist
- Cost of implementing significantly lower than cost of one successful attack
- Test your incident response
- Runbooks only work if practiced
- Quarterly drills identify gaps
- Update procedures based on real incidents
The stale-while-revalidate pattern is excellent engineering when combined with operational discipline. The pattern handles the common case (temporary outages gracefully), while safeguards handle the edge case (active security incidents with manual override).
This is the balance between availability and security that financial services require: automatic grace for technical issues, human judgment for security incidents.
Operational Best Practices
Running JWKS in production requires operational discipline:
1. Monitor Key Usage by kid
Track which keys are actively used:
- Count verifications per
kid(exposes when old keys can be safely removed) - Alert on unknown
kidvalues (potential attack or misconfiguration) - Expose metrics endpoint showing usage distribution across keys
- Graph key usage over time to visualize rotation progress
2. Alert on Verification Failures
Monitor verification failure rates:
- Track success/failure ratio (alert if > 5% failure rate)
- Log detailed failure reasons (
invalid_signature,key_not_found,expired) - Spike in failures often indicates: rotation issues, clock drift, or attacks
- Integrate with alerting systems (PagerDuty, Slack, etc.)
3. Automate Rotation Schedule
Don't wait for incidents - rotate proactively:
- Schedule rotations every 90 days
- Automate the rotation process:
- Phase 1: Generate new key pair
- Phase 2: Publish new key → wait 15 minutes (cache propagation)
- Phase 3: Switch signing → wait for token expiry (typically 24 hours)
- Phase 4: Remove old key
- Log each phase completion for audit trail
- Test rotation in staging first
4. Emergency Rotation Runbook
Have a documented procedure for emergency rotation:
EMERGENCY KEY ROTATION PROCEDURE
Scenario: Private key compromised, must rotate immediately
1. Generate new key pair
$ openssl ecparam -name prime256v1 -genkey -noout -out new-private-key.pem
$ openssl ec -in new-private-key.pem -pubout -out new-public-key.pem
2. Add new key to JWKS (Phase 2)
- kid: emergency-2024-12-05-001
- Deploy to JWKS endpoint
- Verify: curl https://auth.yourcompany.com/.well-known/jwks.json
3. Wait 15 minutes (cache propagation)
4. Switch signing to new key (Phase 3)
- Update signing service configuration
- kid: emergency-2024-12-05-001
- Deploy
5. Remove compromised key from JWKS (Phase 4)
- DO NOT wait for token expiration in emergency
- All tokens signed with compromised key will fail
- Acceptable trade-off vs keeping compromised key active
6. Notify affected parties
- Issuers may need to re-authenticate
- Document incident for post-mortem
Expected Impact:
- Tokens signed with compromised key: Invalid (immediate)
- Users: Must re-authenticate
- Downtime: ~0 seconds (but users will see auth errors until re-auth)
Security Considerations
JWKS improves operational security, but you must protect it:
DO:
- Serve over HTTPS only - Man-in-the-middle attacks on JWKS mean game over
- Set appropriate caching headers -
Cache-Control: public, max-age=600 - Rate limit the endpoint - Protect against DoS (see previous post)
- Monitor JWKS fetches - Unusual spikes might indicate attack
- Version your
kidvalues - Makes rotation tracking easier - Log key usage by
kid- Identify when old keys can be removed - Implement grace period safeguards - Alerting and emergency purge capabilities (covered above)
DON'T:
- Never include private keys in JWKS - Only public keys, ever
- Never disable HTTPS verification - Even in development
- Never trust
kidfrom untrusted tokens - Validate against your JWKS - Never forget to remove old keys - Keep JWKS lean (fewer than 5 active keys)
- Never embed JWKS URLs in JWTs - Attackers could point to their own JWKS
- Never rely solely on grace period - Must have human-in-the-loop verification for security incidents
Replay Attack Prevention: The jti Claim
JWKS verification proves a signature is valid, but it doesn't prevent replay attacks: a malicious actor intercepts a valid signed authorization and replays it hours later to initiate a duplicate payment. Even though the signature is valid (it's a real signed instruction), replaying it is an attack.
Timestamps (exp) provide time-based protection, but you need
per-request uniqueness to detect replays within the expiration window.
Solution: Use jti (JWT ID) claims
When signing a payment instruction, include a unique jti value
(typically a UUID) alongside your business identifier (instruction_id)
and timestamps (iat, exp). During verification, check
if the jti has been seen before. If yes, reject as a replay attack.
If no, store the jti in your audit log or a dedicated cache and
proceed with verification.
Why instruction_id isn't enough: Partners might
legitimately retry a failed authorization with the same
instruction_id after a network timeout or temporary error. But they
should generate a new jti for each attempt. This lets you
distinguish between legitimate retries (different jti, same
instruction_id) and replay attacks (duplicate
jti).
Implementation considerations:
- Store
jtivalues in your audit log for permanent record-keeping, or use a dedicated replay prevention cache (Redis with TTL) for high-throughput systems - Clean up expired
jtivalues based on theexpclaim—no need to store forever - Include
jtivalidation in your verification pipeline before accepting the request - Coordinate with partners on
jtigeneration (they must generate fresh values for retries)
The demo application generates jti values when
signing
authorization requests and stores them in the audit log schema. However, it intentionally does not implement replay
detection (checking for duplicate jti values) to keep the example
focused on JWS signature verification. Production systems should add uniqueness
validation before processing requests.
Testing, Troubleshooting & Disaster Recovery
Production systems require rigorous testing, clear debugging procedures, and rapid recovery capabilities. This section covers production-specific concerns for multi-tenant JWKS caching.
Testing Production JWKS Systems
Production-Specific Test Scenarios
For detailed unit testing, contract testing with partners, and chaos engineering strategies, see the Testing, Migration & Troubleshooting guide. That post covers comprehensive test scenarios including signature verification, algorithm attacks, clock skew, and audit trail testing.
This section focuses on production-specific concerns for multi-tenant JWKS caching with 200+ partners:
- Load testing cache performance: 1000 req/s across 200 partners, measuring cache hit ratio and latency percentiles
- Testing DoS protection triggers: Circuit breaker activation, rate limiting thresholds, debounce window behavior
- Cache warming on deployment: Verify 200 partners load in < 30 seconds with graceful handling of endpoint failures
- Multi-partner failover scenarios: Simulate 50+ partners' JWKS endpoints down, verify stale-while-revalidate behavior
- Telemetry integration: Validate all telemetry events fire with correct metadata and measurements
Critical Test Cases for Multi-Tenant Caching
Test coverage in example app:
- Cache warming tests here
- Circuit breaker testing here
- Rate limiting and debouncing tests here
- Emergency purge tests here
Example: Testing circuit breaker opens after threshold:
# From: test/jws_demo/jws/jwks_cache_test.exs
test "circuit breaker opens after 5 consecutive unknown kids" do
partner_id = "test_partner"
# Trigger 5 consecutive unknown kid failures
for i <- 1..5 do
{:error, :kid_not_found_in_jwks} =
JWKSCache.get_key(partner_id, "unknown-kid-#{i}")
end
# 6th attempt should hit circuit breaker
assert {:error, :circuit_breaker_open} =
JWKSCache.get_key(partner_id, "unknown-kid-6")
# Verify telemetry event fired
assert_received {:telemetry_event,
[:jwks_cache, :circuit_breaker_open],
%{consecutive_unknown_kids: 5},
%{partner_id: "test_partner"}}
end
This test validates the circuit breaker threshold (
@circuit_breaker_threshold 5
in implementation) and confirms
telemetry events fire for monitoring integration.
Load Testing Recommendations
Performance targets for production scale:
| Scenario | Target (p99) | Measurement |
|---|---|---|
| 200 partners, cache hit | < 5ms | ETS lookup + verification |
| 200 partners, cache miss | < 100ms | HTTP fetch + parse + cache |
| Circuit breaker activation | < 10ms | Counter check + log |
| Cache warming (200 partners) | < 30s | Parallel fetch with max_concurrency: 50 |
Use load testing tools like k6 or Locust for multi-partner scenarios. Simulate realistic traffic patterns: 80% cache hits, 15% cache misses (new kids during rotation), 5% unknown kids (attack simulation).
Troubleshooting Production Issues
For complete debugging protocols, common error patterns, and step-by-step troubleshooting procedures (signature verification failures, timestamp issues, algorithm mismatches), see the Testing, Migration & Troubleshooting guide. That post provides a comprehensive debugging protocol with manual verification steps using OpenSSL.
This section covers production-specific debugging challenges unique to multi-tenant JWKS caching:
Production-Specific Debugging
When managing 200+ partners, debugging becomes more complex. Common production scenarios:
- Partner isolation: Which of 200 partners is causing cache failures? Check partner-specific telemetry metrics.
- Circuit breaker state: Why is circuit breaker open for
"partner_abc"? Inspectconsecutive_unknown_kidscounter. - Cache hit ratio drop: Sudden decrease from 95% to 60% - what changed? Check for partner key rotations or JWKS endpoint changes.
- Rotation confusion: Partner rotated keys but requests failing - old vs new kid mismatch. Verify JWKS fetch succeeded and new kid cached.
- Stale cache not refreshing: Grace period active but background refresh failing silently. Check graduated alerting logs.
Using Telemetry to Diagnose Issues
Attach runtime telemetry handlers to debug cache issues in production without redeployment:
# Attach telemetry handler to debug cache issues in production
:telemetry.attach(
"debug-cache-issues",
[:jwks_cache, :get_key],
fn _event, measurements, metadata, _config ->
Logger.info("""
Cache lookup:
Partner: #{metadata.partner_id}
Kid: #{metadata.kid}
Result: #{metadata.result}
Duration: #{measurements.duration}μs
Cache age: #{metadata.cache_age}s
Cache state: #{metadata.cache_state}
""")
end,
nil
)
# Attach to DoS protection events
:telemetry.attach_many(
"debug-dos-protection",
[
[:jwks_cache, :circuit_breaker_open],
[:jwks_cache, :rate_limit_exceeded],
[:jwks_cache, :unknown_kid_rejected]
],
fn event, measurements, metadata, _config ->
Logger.warning("DoS protection triggered: #{inspect(event)}",
measurements: measurements,
metadata: metadata
)
end,
nil
)
These handlers provide real-time visibility into cache behavior and DoS protection activations without modifying code or increasing log verbosity globally.
Quick Diagnostic Checklist
When debugging partner-specific JWKS issues:
- Check telemetry dashboards for partner-specific metrics: cache hit ratio, fetch success rate, verification failures
- Inspect ETS cache state:
:ets.tab2list(:jwks_cache) |> Enum.filter(fn {{pid, _}, _, _, _} -> pid == "partner_abc" end) - Verify partner JWKS endpoint:
curl https://partner-domain/.well-known/jwks.json- is it accessible? Returns valid JSON? - Check circuit breaker state: Look for
{partner_id, :metadata}entry in ETS, inspectconsecutive_unknown_kidsfield - Review last fetch timestamp: Check
metadata.last_fetch_at- when did we last successfully fetch JWKS? - Examine rate limit counters: Check
{partner_id, :rate_limit}entry for unknown kid attempts - Check graduated alerting logs: Search logs for WARNING → ERROR → CRITICAL → EMERGENCY severity progression
- Validate cached kid values: List all
kids for partner, compare with current JWKS endpoint response
Example: Diagnosing why partner_abc requests failing:
# 1. Check what kids we have cached
iex> :ets.tab2list(:jwks_cache)
|> Enum.filter(fn
{{"partner_abc", _kid}, _jwk, _cached_at, _ttl} -> true
_ -> false
end)
[
{{"partner_abc", "2024-12-key"}, %JOSE.JWK{}, 1704470400, 900},
{{"partner_abc", "2025-01-key"}, %JOSE.JWK{}, 1704556800, 900}
]
# 2. Fetch current JWKS to compare
iex> {:ok, resp} = Req.get("https://partner-abc.example/.well-known/jwks.json")
iex> resp.body["keys"] |> Enum.map(& &1["kid"])
["2025-01-key", "2025-02-key"] # New key! Old 2024-12-key removed
# 3. Check if circuit breaker open
iex> :ets.lookup(:jwks_cache, {"partner_abc", :metadata})
[{{"partner_abc", :metadata}, %{consecutive_unknown_kids: 5, ...}}]
# Circuit breaker threshold reached!
# 4. Diagnosis: Partner rotated keys, removed old key before our cache expired.
# Requests with kid=2024-12-key triggered unknown kid, hit circuit breaker.
# Solution: Emergency purge or wait for circuit breaker timeout (60s)
This debugging session reveals the partner removed an old key before our cache TTL expired, triggering circuit breaker activation. Solution: wait 60 seconds for automatic reset, or perform emergency purge to force immediate refresh.
Disaster Recovery & Business Continuity
Production systems must recover quickly from failures. This section covers cache warming, ETS recovery, and emergency procedures specific to multi-tenant JWKS caching.
Cache Warming on Startup
When nodes restart (deployment, crash, scaling event), the ETS cache is empty. Cache warming prevents a thundering herd of JWKS fetches when the first requests arrive for 200 partners simultaneously.
Call the warm_cache/0
function from
Application.start/2
or as a
supervised task in your supervision tree. The function uses
Task.async_stream
with
max_concurrency: 50
to
parallelize JWKS fetches while limiting concurrent HTTP connections.
ETS Table Recovery
ETS tables are in-memory and lost when the owning GenServer process crashes. Recovery strategies:
- Supervised GenServer: Automatic restart, then
warm_cache/0restores state - Graceful degradation: Stale-while-revalidate pattern handles missing cache (fetches on demand)
- Persistent fallback (optional): Store last-known-good JWKS in database for critical partners
Supervision tree with cache warming:
defmodule JWSDemo.Application do
def start(_type, _args) do
children = [
JWSDemo.Repo,
{JWSDemo.JWS.JWKSCache, []}, # Starts GenServer, creates ETS
{Task, fn -> JWSDemo.JWS.JWKSCache.warm_cache() end} # Warm after startup
]
opts = [strategy: :one_for_one, name: JWSDemo.Supervisor]
Supervisor.start_link(children, opts)
end
end
GenServer crashes are rare but possible (OOM, ETS corruption, unhandled exception). When it happens:
- Supervision tree detects crash and restarts GenServer
- GenServer
init/1creates fresh ETS table - Supervised
Taskcallswarm_cache/0 - Cache recovered within 30 seconds (RTO target)
Mass Partner Outage Recovery
What if 50+ partners' JWKS endpoints go down simultaneously (cloud provider outage, Border Gateway Protocol incident, regional datacenter failure)?
Automatic recovery procedure:
- Grace period activates: Stale cache serves requests (24-hour window), preventing legitimate transaction rejections
- Background refresh retries: Each partner's background refresh task retries with exponential backoff
- Graduated alerts fire: Ops team notified at WARNING → ERROR → CRITICAL severity levels
- Automatic recovery: When partner endpoints return, background refresh succeeds and updates cache
- No manual intervention required: System self-heals when connectivity restored
During the outage, requests continue processing with stale keys (risk accepted for availability). Graduated alerting ensures human oversight for security decisions. See Grace Period Security section for complete safeguards.
Emergency Cache Purge (Security Incident)
Already covered in Safeguard 3: Emergency Cache Purge in the Grace Period Security section. Quick reference for disaster recovery context:
# Immediate purge during security incident
JWKSCache.emergency_purge(
"partner_abc",
"ops.alice@example.com",
"INC-2025-001: Partner confirmed private key compromise via phone"
)
# Result:
# - All cached keys for partner_abc removed from ETS
# - Next request triggers fresh JWKS fetch
# - Audit log entry created (regulatory compliance)
# - Telemetry event emitted to monitoring systems
Recovery Time Objectives (RTO)
Production systems require clear recovery targets:
| Failure Scenario | RTO | Recovery Mechanism |
|---|---|---|
| Single partner JWKS down | 0s | Stale cache serves requests immediately |
| Node restart (deployment) | 30s | Supervised restart + cache warming (50 parallel) |
| ETS corruption/crash | 30s | GenServer restart + cache warming |
| Mass partner outage (50+) | 0s | Grace period + background refresh retry |
| Circuit breaker open | 60s | Automatic reset on successful cache hit |
| Security incident (key compromise) | 5min | Manual emergency purge + out-of-band verification |
These RTOs balance availability (minimize downtime) with security (verify incidents before resuming). The 0s RTO for common failures (partner outages) reflects the business priority of not blocking legitimate transactions during temporary infrastructure issues.
Recovery Testing
Test disaster recovery procedures quarterly to verify RTOs and identify gaps:
- Node crash simulation: Kill BEAM process in staging
(
kill -9 $PID), verify cache warming completes in < 30s - Verify warm cache telemetry: Check
[:jwks_cache, :warm_cache_complete]event fires with success_count/total metrics - Partner endpoint failure: Block partner JWKS with firewall rule, verify stale cache serves requests
- Background refresh retry: Confirm exponential backoff logged, automatic recovery when firewall removed
- Security incident drill: Trigger emergency purge, verify audit trail created, cache cleared, telemetry fired
- Circuit breaker reset: Trigger circuit breaker (5 consecutive unknown kids), wait 60s, verify automatic reset on successful fetch
- Measure actual vs target RTO: Document deltas, update procedures or targets as needed
Document test results in your operational runbook. Update recovery procedures based on findings. Schedule next drill in calendar (quarterly cadence ensures procedures stay fresh).
Wrapping Up
Production JWKS is about more than just publishing keys and caching them. It's about:
- Multi-tenancy: Managing 200+ partners with different configurations, endpoints, and SLAs
- Resilience: Graceful degradation when partner endpoints fail, with stale-while-revalidate caching
- Security: Protecting the grace period from compound attacks with graduated alerting and emergency purge capabilities
- Operations: Monitoring, runbooks, incident response, and regular key rotation automation
- Testing & Recovery: Load testing at scale, disaster recovery procedures, and 30-second cache warming for rapid failover
Key takeaways from this post:
- Multi-tenant architecture requires per-partner configuration - Different algorithms, TTLs, clock skew tolerance
- Stale-while-revalidate is essential for availability - But requires security safeguards for financial services
- Grace periods need human-in-the-loop verification - Automatic for technical issues, manual override for security incidents
- Emergency purge is not over-engineering - It's standard practice for financial services handling irreversible transactions
- Monitoring and alerting are critical - You need visibility into all 200 partners' key health
- Circuit breakers and rate limiting protect you - One partner's outage shouldn't impact others
- Disaster recovery enables rapid restoration - Cache warming, supervised restart, and graceful degradation minimize downtime to 30 seconds or less
From capability to maturity:
The previous post gave you the capability to implement JWKS. This post gave you the operational maturity to run it in production at scale. The difference shows up during:
- Incidents: Do you wake up at 3 AM to manually fix key issues? Or do automated safeguards handle common failures while alerting on anomalies?
- Audits: Can you prove you have incident response procedures? Can you show the audit trail of emergency cache purges?
- Partner onboarding: Does adding a new partner require code changes and deployments? Or database configuration?
- Security reviews: Can you explain how you protect against compound attacks? Do you have runbooks?
Production checklist:
Before going live with multi-tenant JWKS in financial services:
- Per-partner configuration with allowed algorithms and clock skew
- ETS-backed caching for microsecond lookups
- Stale-while-revalidate with 24-hour grace period
- Graduated alerting (WARNING → ERROR → CRITICAL → EMERGENCY)
- Emergency purge function with audit logging
- Out-of-band partner contact list
- Incident response runbook documented and tested
- Telemetry integration (Datadog, PagerDuty, etc.)
- Per-partner circuit breakers
- Database indexes on partner lookups
- Monitoring dashboards for all partners
- Quarterly incident response drills
What's next:
You now have JWKS running at scale with proper security and operational safeguards. But signatures only prove a message was sent. How do you store that proof so it's still valid years later when a regulator asks?
Our next focus will be on the long game: building audit trails that survive disputes years later. You'll learn how to store JWS signatures for regulatory compliance, handle canonicalization and replay attacks, deal with clock skew, and create the "forever proof" that wins disputes and satisfies auditors.
This article is part of a series on understanding the hows and whys of JSON Web Signatures (JWS).
There's accompanying code: it's refered to and linked throughout the content. But if you'd rather just read raw code, head over here.