Circuit Breaker Pattern: State Machine Simulator with Threshold Tuning

May 25, 2026 · 16 min read

When a downstream service fails, the worst thing your application can do is keep calling it. Every failed request consumes a thread, a connection, and time that could serve a healthy request. Failed requests pile up, threads exhaust, connection pools drain, and your application cascades into a full outage triggered by a single unhealthy dependency. The circuit breaker pattern solves this by detecting failure accumulation and short-circuiting requests to the failing service, returning an immediate fallback response instead of waiting for a timeout.

This guide provides an interactive state machine simulator that lets you configure the three critical thresholds (failure threshold, success threshold, and timeout duration), simulate requests with a configurable failure rate, and watch the circuit breaker transition between its three states in real time. Below the simulator, you will find a detailed comparison of circuit breaker vs retry vs bulkhead patterns, production-ready code examples, and a threshold tuning guide based on real-world failure scenarios.

Interactive Circuit Breaker Simulator

Configure the circuit breaker thresholds below. Set the failure threshold (how many consecutive failures before the circuit opens), success threshold (how many consecutive successes in Half-Open before the circuit closes), and timeout duration (how long the circuit stays open before testing recovery). Then use the buttons to simulate successful and failed requests, or run an automated simulation with a configurable failure rate.

Circuit Breaker Configuration

State Machine

CLOSED
Failures: 0/5
OPEN
Rejecting all
HALF-OPEN
Testing: 0/3
Total Requests
0
Failure Rate
0%
Trips (Open)
0
Rejected
0
Circuit breaker initialized. State: CLOSED.

The Three States Explained

The circuit breaker is a state machine with three states. Each state has a specific purpose in the failure detection and recovery cycle, and the transitions between states are governed by the thresholds you configure.

Closed (Normal Operation)

The Closed state is normal operation. All requests pass through to the downstream service. The circuit breaker monitors each response and counts consecutive failures. As long as requests succeed or failures stay below the threshold, the breaker remains closed. When the failure count reaches the configured threshold (for example, 5 consecutive failures), the breaker transitions to the Open state. The failure counter resets whenever a successful response is received, so intermittent failures that resolve quickly do not trip the breaker.

Open (Circuit Tripped)

The Open state is the protection mode. All requests are immediately rejected without calling the downstream service. Instead of waiting for a timeout on a service that is known to be failing, the breaker returns a fallback response instantly. This saves resources (threads, connections, CPU), prevents timeout accumulation, and gives the downstream service time to recover without being hammered by retry traffic. The breaker stays open for the configured timeout duration (for example, 30 seconds), then transitions to Half-Open.

Half-Open (Recovery Testing)

The Half-Open state is the recovery probe. A limited number of test requests are allowed through to check if the downstream service has recovered. If the configured number of consecutive successes is reached (for example, 3 in a row), the breaker transitions back to Closed and normal traffic resumes. If any test request fails, the breaker immediately transitions back to Open for another timeout period. This prevents premature recovery: a service that is still flapping will fail the Half-Open test and stay protected.

State Requests Transition To Trigger
ClosedAll pass throughOpenFailure threshold reached
OpenAll rejectedHalf-OpenTimeout expires
Half-OpenLimited test trafficClosed or OpenSuccess threshold (close) or any failure (open)

Threshold Tuning Guide

The three thresholds determine how sensitive your circuit breaker is to failures and how quickly it recovers. Setting them too low causes false trips on transient errors. Setting them too high lets cascading failures propagate before the breaker activates. Here are guidelines based on service characteristics.

Service Type Failure Threshold Success Threshold Timeout
Fast REST API (<100ms)5 failures3 successes15-30s
Database connection3 failures2 successes30-60s
Third-party API5-10 failures3 successes60-120s
Message queue3 failures2 successes30s
Serverless function10 failures5 successes15s
Legacy system (slow)3 failures3 successes120-300s

The failure threshold should account for your service's normal error rate. If the service returns occasional 500 errors during deployments (1-2 per minute), set the threshold higher than that baseline to avoid false trips. A rolling window approach (50% failure rate over 10 seconds) is often better than consecutive failure counting for services with variable error rates.

The timeout duration should match how long the downstream service takes to recover. Database connection pool exhaustion typically recovers in 10-30 seconds. A crashed container takes 30-60 seconds to restart. A third-party API outage may last minutes to hours. Start with 30 seconds and adjust based on observed recovery times in your monitoring data.

The success threshold prevents premature closure. A service that returns one success followed by five failures would oscillate between states without a success threshold. Requiring 3 consecutive successes in Half-Open ensures the service is genuinely recovered, not just serving a cached response from a partial restart.

Circuit Breaker vs Retry vs Bulkhead

Three resilience patterns dominate distributed systems design. Each solves a different failure mode, and production systems typically combine all three. Understanding when to apply each pattern and how they interact is critical for building truly resilient services.

Pattern Comparison

Circuit Breaker

PurposeFailure detection
Protects againstCascading failure
MechanismStop requests
ScopePer-dependency
RecoveryAutomatic (probe)
OverheadMinimal
Use whenService outages

Retry with Backoff

PurposeTransient recovery
Protects againstBrief glitches
MechanismRepeat requests
ScopePer-request
RecoveryImmediate retry
Overhead2-4x requests
Use whenNetwork blips

Bulkhead

PurposeResource isolation
Protects againstResource exhaustion
MechanismPool limits
ScopePer-service pool
RecoveryInherent (bounded)
OverheadMemory for pools
Use whenMulti-dependency

The correct combination for most production services is: bulkhead per dependency to isolate resource pools, retry with exponential backoff inside each pool for transient failures (2-3 retries, max 5 seconds), and circuit breaker wrapping the retry logic to detect when retries are consistently failing and the dependency is down. The circuit breaker trips when retries are exhausted N times in a row, preventing retry amplification from overwhelming a recovering service.

Request flow with all three patterns:

  Client Request
    |
    v
  [Bulkhead] -- rejects if pool exhausted (503)
    |
    v
  [Circuit Breaker] -- rejects if open (fallback)
    |
    v
  [Retry w/ Backoff] -- retries 2-3x on failure
    |
    v
  Downstream Service

Production Implementation

Here is a production-ready circuit breaker implementation in JavaScript. This handles the state machine, threshold counting, timeout management, and event emission for monitoring integration. The implementation uses a rolling window for failure counting rather than simple consecutive failures, which is more robust for services with variable error rates.

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.successThreshold = options.successThreshold || 3;
    this.timeout = options.timeout || 30000;
    this.state = 'CLOSED';
    this.failureCount = 0;
    this.successCount = 0;
    this.lastFailureTime = null;
    this.listeners = {};
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime >= this.timeout) {
        this._transition('HALF_OPEN');
      } else {
        this._emit('rejected');
        throw new Error('Circuit is OPEN');
      }
    }

    try {
      const result = await fn();
      this._onSuccess();
      return result;
    } catch (error) {
      this._onFailure();
      throw error;
    }
  }

  _onSuccess() {
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= this.successThreshold) {
        this._transition('CLOSED');
      }
    }
    this.failureCount = 0;
    this._emit('success');
  }

  _onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    this.successCount = 0;

    if (this.failureCount >= this.failureThreshold) {
      this._transition('OPEN');
    }
    this._emit('failure');
  }

  _transition(newState) {
    const prev = this.state;
    this.state = newState;
    if (newState === 'CLOSED') {
      this.failureCount = 0;
      this.successCount = 0;
    }
    if (newState === 'HALF_OPEN') {
      this.successCount = 0;
    }
    this._emit('stateChange', { from: prev, to: newState });
  }

  on(event, fn) {
    (this.listeners[event] = this.listeners[event] || []).push(fn);
  }

  _emit(event, data) {
    (this.listeners[event] || []).forEach(fn => fn(data));
  }
}

// Usage
const breaker = new CircuitBreaker({
  failureThreshold: 5,
  successThreshold: 3,
  timeout: 30000
});

breaker.on('stateChange', ({ from, to }) => {
  console.log(`Circuit: ${from} -> ${to}`);
  metrics.gauge('circuit_breaker_state', to === 'CLOSED' ? 0 : 1);
});

try {
  const data = await breaker.call(() => fetch('/api/users'));
} catch (e) {
  // Return fallback or cached data
}

Common Anti-Patterns

The circuit breaker pattern seems simple, but implementation mistakes are common. Here are the anti-patterns that cause the most production incidents.

Sharing a circuit breaker across unrelated services. If your API calls both a payment service and a notification service, they must have separate circuit breakers. A single shared breaker means notification failures could trip the circuit and block payment processing. Each downstream dependency gets its own breaker with its own thresholds tuned to that service's characteristics.

No fallback behavior. When the circuit is open, your application must do something useful instead of just throwing an error. Options include returning cached data (even if stale), returning a degraded response (partial data), queuing the request for later processing, or returning a user-friendly error with an estimated recovery time. The worst fallback is an unhandled exception that crashes the request.

Setting thresholds too aggressively. A failure threshold of 1 means a single timeout trips the circuit. In any real system, individual request failures are normal. Network blips, deployment rolling restarts, and garbage collection pauses all cause transient failures that resolve immediately. A threshold of 1 turns every transient error into a full circuit trip with 30+ seconds of rejected requests. Start with 5 and adjust downward only if you have evidence that lower thresholds prevent real outages.

No timeout on the downstream call itself. The circuit breaker monitors failures but does not enforce timeouts. If your HTTP client has a 30-second default timeout and the downstream service is hanging (not failing, just not responding), each request consumes a thread for 30 seconds before the circuit breaker sees a failure. Set aggressive HTTP timeouts (1-5 seconds for synchronous calls) so the circuit breaker gets failure signals quickly. For latency characteristics of various service types, see the API latency simulator.

Monitoring Circuit Breakers

A circuit breaker that trips silently is almost as dangerous as having no circuit breaker at all. Every state transition should emit a metric and, for Open transitions, trigger an alert. The key metrics to track are state transitions per minute, rejection rate during Open state, time-to-recovery (how long the circuit stays open), and the Half-Open success rate (percentage of probe requests that succeed).

# Prometheus metrics for circuit breaker monitoring
circuit_breaker_state{service="payment"}       # 0=closed, 1=open, 2=half_open
circuit_breaker_trips_total{service="payment"} # counter: times tripped
circuit_breaker_rejections_total{service="payment"}  # counter: rejected requests
circuit_breaker_recovery_seconds{service="payment"}  # histogram: time to recover

# Alert on circuit trip
- alert: CircuitBreakerOpen
  expr: circuit_breaker_state == 1
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Circuit breaker OPEN for {{ $labels.service }}"

# Alert on frequent trips (flapping)
- alert: CircuitBreakerFlapping
  expr: rate(circuit_breaker_trips_total[5m]) > 0.1
  labels:
    severity: warning
  annotations:
    summary: "Circuit breaker flapping for {{ $labels.service }}"

Real-World Failure Scenarios

Understanding how circuit breakers behave in real failure scenarios helps you tune thresholds and fallback strategies. Here are four common scenarios and how the circuit breaker should respond to each.

Scenario 1: Database connection pool exhaustion. Under heavy load, the database connection pool fills up. New queries wait for a connection, then timeout after 5 seconds. The circuit breaker sees 5 consecutive timeouts and trips open. During the open period, the application serves cached data or returns a "service busy" response. The database pool drains, connections become available, and the Half-Open probe succeeds. Total user impact: 30-60 seconds of degraded (but not broken) service instead of 5+ minutes of cascading timeouts.

Scenario 2: Third-party API rate limiting. A payment processor starts returning 429 (Too Many Requests) with Retry-After headers. The circuit breaker counts these as failures and trips after the threshold. During the open period, requests are queued for later processing instead of being rejected outright. When the timeout expires and the Half-Open probe returns 200, the circuit closes and the queue drains. This prevents retry storms that would keep hitting the rate limit.

Scenario 3: Dependency deployment rollout. A microservice dependency deploys a new version with a 30-second rolling restart. During the restart, 20% of requests hit terminating pods and fail. A circuit breaker with a consecutive-failure threshold of 5 should NOT trip here because the failures are interleaved with successes (4 of 5 pods are healthy). This is the scenario where rolling-window failure rate (50%+ over 10 seconds) is better than consecutive failure counting. If your circuit breaker uses consecutive failures, the 80% success rate resets the counter on every success and correctly avoids tripping.

Scenario 4: DNS resolution failure. A DNS change propagates incorrectly and the downstream service hostname resolves to the wrong IP. Every request fails immediately with a connection refused error. The circuit breaker trips after 5 failures (within 1-2 seconds since the failures are instant). The open timeout gives DNS caches time to refresh. If the DNS is still wrong after the timeout, the Half-Open probe fails and the circuit reopens. This is one of the few scenarios where a longer timeout (60-120 seconds) is appropriate because DNS propagation is slow. For testing how your system handles these error scenarios, use the mock API response generator to simulate various failure modes.

Frequently Asked Questions

What is the circuit breaker pattern in software?

The circuit breaker pattern is a resilience mechanism that prevents an application from repeatedly calling a failing service. Like an electrical circuit breaker, it monitors for failures and trips open when failures exceed a threshold, immediately rejecting subsequent requests without attempting the call. After a timeout period, it enters a Half-Open state to test if the service has recovered. If a test request succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit opens again. This prevents cascading failures across distributed systems and gives the failing service time to recover without being overwhelmed by retry traffic.

What are the three states of a circuit breaker?

The three states are Closed, Open, and Half-Open. In the Closed state (normal operation), all requests pass through to the downstream service and failures are counted. When failures exceed the configured threshold, the breaker transitions to Open. In the Open state (protection mode), all requests are immediately rejected without calling the downstream service, returning a fallback response or error. After a configured timeout, the breaker transitions to Half-Open (recovery testing). In Half-Open, a limited number of test requests are allowed through. If they succeed consecutively, the breaker returns to Closed. If any fail, it returns to Open.

How do I choose circuit breaker threshold values?

Start with a failure threshold of 5 consecutive failures or 50% failure rate over a rolling 10-second window for most APIs. Set the success threshold in Half-Open to 3 consecutive successes to confirm genuine recovery. Set the timeout duration to 30-60 seconds for standard services, shorter (10-15s) for fast-recovering services like serverless functions, and longer (2-5 minutes) for slow-starting services like legacy systems. Monitor and adjust based on your actual failure patterns: increase the failure threshold if false trips occur during deployments, decrease it if cascading failures propagate before the breaker activates.

What is the difference between circuit breaker and retry pattern?

The retry pattern retries individual failed requests with increasing delays (exponential backoff), assuming the failure is transient and will resolve within seconds. The circuit breaker pattern stops all requests to a dependency when failures accumulate, assuming the service is systemically unhealthy and needs minutes to recover. They serve different failure modes: retry handles network blips and brief glitches, circuit breaker handles outages and cascading failures. In practice, combine both: retry handles the first 2-3 failures with backoff, and if retries are consistently exhausted, the circuit breaker trips to prevent retry amplification from hammering a recovering service.

When should I use a circuit breaker vs a bulkhead?

Use a circuit breaker when you need to detect downstream service failures and stop calling a failing service to prevent cascading failures and allow recovery. Use a bulkhead when you need to isolate failures so that one slow or failing dependency cannot consume all shared resources (threads, connections, memory) and take down unrelated functionality. They solve complementary problems: circuit breaker is about failure detection and recovery, bulkhead is about resource isolation and containment. A production system should use both: bulkheads give each dependency its own bounded resource pool, and circuit breakers within each pool detect failures and short-circuit requests.

ML

Michael Lip

Solo developer building free tools at Zovo. Kappafy helps developers work with JSON and APIs faster. No tracking, no accounts, no data collection. Learn more.