lanyxp

Posted on Jun 17

Your circuit breaker stops at the service layer. Slow SQL needs one too.

#springboot #database #java #opensource

Your circuit breaker stops at the service layer. Slow SQL needs one too.

A single slow query can take down an entire service in seconds. This post starts with a production cascade failure and walks through how I built a Spring Boot Starter that does circuit breaking at the MyBatis interceptor layer, keyed by SQL type + SQL fingerprint — plus a few design trade-offs worth talking about. The SDK is open source, on Maven Central, and takes one dependency + a bit of YAML to wire in.

🔗 Source: GitHub · Gitee (a Star ⭐ helps if you find it useful)

1. A short war story: how one slow query took down a service

One evening at peak traffic, the order service started timing out everywhere and alerts went off. The root cause was mundane: a list query had picked up a new filter condition that wasn't covered by an index, turned into a full table scan, and took 30+ seconds per call.

The thing is, it doesn't stop at "slow once":

One slow query holds a DB connection hostage for 30 seconds.
Under load, more of the same request keep coming in, eating connections one by one.
The connection pool drains → every other query (including perfectly healthy ones) can't get a connection, queues up, times out.
Upstream threads all block waiting for connections → endpoints time out → callers retry → it gets worse.

One slow query ends up dragging down the whole database, and the whole service.

In the post-mortem the first reaction was: "Don't we already have a circuit breaker?"

2. Why the usual circuit breakers aren't enough

Resilience4j, Hystrix, Sentinel — all mature. But they operate at the endpoint / RPC / method-call level. They can tell you "this endpoint's failure rate is high, trip it", but they don't understand SQL:

They don't know which query is slow — they can only trip the whole endpoint, even though 90% of the SQL behind it might be perfectly healthy.
They aggregate by endpoint, but the actual fault lives at the level of one class of SQL. Within a single endpoint, select * from order where ... being slow says nothing about whether the insert is fine.
They can't catch SQL that doesn't sit on an obvious boundary (scheduled jobs, Mapper calls buried deep in a call chain).

What I actually wanted was a breaker whose granularity lands exactly at "SQL type + SQL shape":

When one class of SQL goes bad, fast-fail only that class — leave the rest alone. And the trip has to happen before the request is actually sent to the DB, so the connection pool stays protected.

The MyBatis / MyBatis-Plus Interceptor (plugin) is exactly the right seam: every CRUD statement passes through it, you can read the SQL, measure the execution time, and intercept before it runs. So I built this.

3. What it looks like: two steps to wire in

The punchline first; the rest is just the "why". Integration is zero-touch for your business code:

1. Add the dependency (Spring Boot 2.x; for 3.x use -spring-boot3-starter):

<dependency>
    <groupId>io.github.showingdata.starter.framework</groupId>
    <artifactId>sql-circuit-breaker-spring-boot-starter</artifactId>
    <version>2.1.5</version>
</dependency>

2. Configure the YAML (per SQL type; all four are required):

sql-circuit-breaker:
  enabled: true
  select:
    timeout-ms: 10000          # SELECT timeout threshold
    failure-threshold: 3       # consecutive timeouts that trip the breaker
    circuit-open-ms: 60000     # how long the breaker stays open (60s)
    cache-max-size: 10000      # max entries in the breaker-state cache
  insert:    { timeout-ms: 5000, failure-threshold: 1, circuit-open-ms: 30000, cache-max-size: 5000 }
  update:    { timeout-ms: 5000, failure-threshold: 1, circuit-open-ms: 30000, cache-max-size: 5000 }
  delete:    { timeout-ms: 5000, failure-threshold: 1, circuit-open-ms: 30000, cache-max-size: 5000 }

Restart and you're done — not a single line of business code changes. From then on, when a class of SQL times out enough times in a row, it trips: while open, that class of SQL fast-fails locally and never reaches the DB, giving the database room to breathe; it recovers automatically when the window expires.

Why split SELECT and DML config? Because their risk profiles are completely different. DML holds locks and has a wide blast radius — often "one timeout should trip it". SELECT comes in many shapes and is more tolerant — you can afford "three in a row before tripping". A single one-size-fits-all threshold just doesn't make sense.

4. The design, and the parts worth talking about

4.1 The matching unit: SQL fingerprint, not the full SQL

If you key on the full SQL text, then where user_id = 123 and where user_id = 456 become two different statements with separate counters — which is wrong: they share the same shape, and they're slow because of the query pattern, not the specific parameter.

So the matching unit is the SQL fingerprint: normalize the parameters away, keep the structure.

-- raw (two calls, different params)
SELECT * FROM order WHERE user_id = 123 AND status = 1
SELECT * FROM order WHERE user_id = 456 AND status = 2

-- fingerprint (identical)
select * from order where user_id = ? and status = ?

The rule is simple: replace parameter placeholders (? / #{xxx}) with ?, collapse whitespace, lowercase, take the MD5 as the key.

That way one trip protects the entire class of SQL, instead of the count getting "diluted" across parameter values and never reaching the threshold.

The final breaker key is:

datasource_id : sql_type : fingerprint_md5
e.g.  default:SELECT:a3f2c1...

The datasource prefix is for multi-datasource setups (more on that later), the middle is the SQL type, the tail is the fingerprint MD5 (to avoid absurdly long keys).

4.2 State machine: two states, no half-open

Many circuit breakers are three-state: CLOSED → OPEN → HALF_OPEN (probing). I deliberately kept it to two:

            consecutive timeouts >= failureThreshold
  CLOSED ──────────────────────────────────────────→ OPEN
    ↑                                                  │
    └──────── auto-reset when circuitOpenMs elapses ───┘

Why drop half-open?

Half-open means managing "probe permits" (let a few requests through to test the waters) and reconciling those permits under concurrency — non-trivial complexity.
And the SQL breaker's failure-threshold is tiny to begin with (default 3, DML even 1). After the open window expires, it just fully reopens; even if the fault isn't fixed, it re-trips within a couple of failures. The cost of "re-trip fast" is entirely acceptable here.
Two states are simple to reason about and easy on ops: "either it's open or it isn't", no mysterious middle state.

One engineering detail: while OPEN, requests that were already in flight before the trip and only time out late do not add to the count or refresh the window — so the open window stays exactly circuit-open-ms and never gets silently extended by stragglers.

4.3 A misconfiguration I designed away: cache "expire-after-access" isn't configurable

Each SQL fingerprint's breaker state lives in a Guava Cache (four independent caches, one per SQL type). Each cache has two eviction policies:

Policy	Source	Purpose
LRU size cap	`cache-max-size` (per type)	Hard memory ceiling, prevents unbounded growth
expire-after-access	derived from `circuit-open-ms` (20× and at least 5 min), not configurable	Cleans up long-idle SQL

The point is that second row: I deliberately don't let you configure expire-after-access.

Because it has a hard constraint with the open window: the access-expiry must be significantly larger than circuit-open-ms. Otherwise you get a subtle bug — a query trips and is sitting in OPEN, but during that window no new requests come in (it's fast-failing, and callers may have backed off), so its state gets evicted by access-expiry prematurely; the next request finds no state, treats it as brand-new CLOSED, and lets it through… and your protection just quietly weakened.

By deriving it from circuit-open-ms, the "two values configured backwards" mistake becomes impossible. The memory ceiling is still enforced by cache-max-size; access-expiry only handles idle cleanup. If a constraint can eliminate a misconfiguration, don't leave it as a config knob.

4.4 It trips on timeouts only, not on exceptions

This is a deliberate boundary: exceptions thrown by SQL execution — connection errors, syntax errors, constraint violations — are never counted toward the breaker. Only "execution time exceeded the threshold" counts as a failure.

Why? Because the breaker protects against one specific failure mode: slow SQL exhausting the connection pool. A syntax error is a code bug; a constraint violation is a data problem. Neither holds connections hostage or drags down the DB, and counting them would only cause false trips (a temporarily-erroring query gets tripped, which actually masks the real business problem). Single responsibility keeps it predictable.

4.5 A performance detail under load: the exception doesn't fill in its stack trace

The SqlCircuitBreakerException thrown on fast-fail overrides fillInStackTrace() to skip stack capture entirely.

Because during a high-concurrency trip, this exception may be thrown thousands of times per second, and fillInStackTrace is one of the more expensive things the JVM does (it walks the entire call stack). Skipping it saves a lot of CPU/memory.

But that buys you a gotcha you must know about (next section).

4.6 Heads up: you can't catch this breaker exception

Following from the above — because the exception carries no stack trace, and MyBatis re-wraps it in a MyBatisSystemException on the way out:

// ❌ Writing this in your Service / Controller won't catch it!
try {
    orderMapper.queryByUser(param);
} catch (SqlCircuitBreakerException e) {   // the thrown type is MyBatisSystemException; instanceof doesn't match
    ...
}

The right way is to catch it centrally in a global exception handler — and, importantly, log using the wrapper MyBatisSystemException, because its stack trace contains the full business call chain (Controller → Service → Mapper):

@RestControllerAdvice
public class GlobalExceptionHandler {

    @ExceptionHandler(MyBatisSystemException.class)
    public ResponseEntity<?> handle(MyBatisSystemException ex) {
        SqlCircuitBreakerException cb = findCircuitBreaker(ex);   // walk the cause chain
        if (cb != null) {
            // key point: log with the wrapper `ex` — it has the business line numbers
            log.error("[SqlCircuitBreaker] fast-fail | key={} | business stack below", cb.getCircuitKey(), ex);
            return ResponseEntity.status(503).body(...);
        }
        ...
    }
}

This way you get both the "no stack fill" performance and the ability to pinpoint which Service/Controller triggered it. Best of both worlds — as long as you know the mechanism exists.

4.7 Config priority: from one-size-fits-all to fine-grained

Four layers of config, higher overrides lower:

ThreadLocal (programmatic)  >  method annotation  >  interface annotation  >  global YAML

Global YAML is the only place you can configure per SQL type (fine-grained).
Annotations / ThreadLocal are coarse overrides — they apply uniformly to all SQL types under the annotated Mapper / method.

Typical usage:

@SqlCircuitBreaker(timeoutMs = 5000)                          // interface level: 5s for this whole Mapper
public interface OrderMapper extends BaseMapper<Order> {

    @SqlCircuitBreaker(timeoutMs = 2000, circuitOpenMs = 30000)   // method-level override
    List<Order> complexQuery(QueryParam param);

    @SqlCircuitBreaker(disableCircuitBreaker = true)             // admin query, skip the breaker
    List<Order> adminQuery(AdminParam param);
}

ThreadLocal fits the "loosen/disable just for this request" case — e.g. a scheduled data-repair job that you know will be slow but don't want to trip the breaker:

try {
    SqlCircuitBreakerContext.disableCircuitBreaker();
    orderMapper.batchFixData(ids);
} finally {
    SqlCircuitBreakerContext.clear();   // must clear, or thread-pool reuse pollutes the next request
}

A trade-off here: the interceptor does not auto-clear. We want a Service to set it once and have it apply across the several Mapper calls it makes; if the interceptor cleared after the first SQL, it'd be lost from the second onward. The cost is putting clear() on the caller (in a finally), in exchange for "set once, applies to the whole block" semantics.

5. Production essential: observability

The scariest thing about a circuit breaker is "it's quietly working, but you don't know". With spring-boot-actuator on the classpath, the SDK auto-exposes 5 Micrometer metrics with zero extra config:

Metric	Type	Meaning
`sql.circuit.breaker.intercept.total`	Counter	total SQL intercepted
`sql.circuit.breaker.timeout`	Counter	timeouts
`sql.circuit.breaker.open`	Counter	trips (CLOSED→OPEN)
`sql.circuit.breaker.fast.fail`	Counter	fast-fails
`sql.circuit.breaker.open.count`	Gauge	breakers currently OPEN (real-time)

The most useful one is that last Gauge. A single alert rule is enough:

# any breaker currently open → alert immediately; back to zero means everything auto-recovered
sql_circuit_breaker_open_count > 0

There's also a trap that only shows up at scale: timeout / open / fast.fail carry a mapper_id label by default (handy for pinpointing a specific Mapper). But that label explodes your time series — per service it's roughly (# Mapper methods) × 4 (types) × 3 (metrics). With a few hundred Mappers across multiple replicas, Prometheus chokes (and series-billed backends cost you real money).

So there's a switch to drop the mapper_id label, collapsing the series count from N×12 to a fixed 12; you then locate the specific Mapper via logs instead:

sql-circuit-breaker:
  metrics:
    include-mapper-id: false   # turn off when you're at scale or sensitive to series cardinality

When you build infrastructure-type components, you have to think ahead to "what happens once this is at scale". Labels-on-by-default makes the out-of-the-box experience nice, but you must give people an escape hatch for when it grows.

6. Other "production-grade" details I bolted on

A few, all forged by production:

Notifications fire once: implement the MessageCenterClient interface to push breaker events to Slack / Teams / whatever. But notifications fire only when the breaker first opens, not on the fast-fail path — otherwise a high-concurrency trip would spam thousands of messages a second.
Multi-datasource isolation: the breaker key includes a datasource identifier, so a slow query in DB-A won't trip DB-B. With a runtime-routing framework, implement a DataSourceKeyResolver that returns the current datasource key; single-datasource needs zero config.
SELECT ... FOR UPDATE misclassification: MyBatis types SQL by its XML tag, so SELECT ... FOR UPDATE is treated as a SELECT (the loose threshold) — even though it holds locks and behaves more like DML. Tighten such methods with a dedicated annotation.
Config validated at startup: all four SQL-type blocks are required and illegal values (e.g. timeout-ms <= 0) fail the boot — surfacing errors at startup rather than when some query hits it in production.
Unified log prefix [SqlCircuitBreaker]: easy to filter in ELK, with a ready-made example for a dedicated logback appender (don't forget additivity="false", or you haven't actually isolated anything).

7. Its boundaries: know what it is not

Being honest about boundaries matters more than hyping features:

It's an "after-the-fact, statistical" breaker — it does not interrupt in-flight SQL. A query already sent to the DB and running won't be killed by the breaker (that's the job of JDBC / driver / pool timeouts). The breaker stops the subsequent requests of the same class from piling on.
State lives in each instance's memory, not shared across instances. In a multi-replica deployment each counts on its own, so treat the thresholds as "per-instance". Under uneven traffic you can lower them so a single instance converges faster.
Timeouts only, not other exceptions (covered above — deliberate).

These aren't shortcomings, they're explicit design boundaries — a component that's clear about what it won't do is one that won't get misused under the wrong expectations.

8. Wrapping up

Back to the original question: a single slow query shouldn't have the power to take down an entire service.

The idea behind this SDK is actually plain — push circuit breaking down to the MyBatis interceptor layer, use the SQL fingerprint as the matching unit, configure per SQL type, and fast-fail before the request ever reaches the DB. It fills exactly the "SQL layer" that Resilience4j / Hystrix-style frameworks can't reach.

But what really decides whether a component survives in production is rarely the core algorithm — it's the small trade-offs:

why two states and not half-open;
why access-expiry is derived instead of configurable;
why it trips on timeouts only, not exceptions;
why the exception skips its stack trace, and the "can't catch it" gotcha that follows;
why metrics need an escape hatch to drop a label.

A good piece of infrastructure bakes the "traps you've hit" and the "boundaries you've thought through" into its defaults and constraints, so its users hit fewer of them. Hope this is useful next time you're building something similar.

The SDK is open source and on Maven Central, with both Spring Boot 2.x and 3.x support, running in production across several systems. One dependency + a bit of YAML to integrate. Try it, break it, file issues.

📦 Repo:

GitHub: https://github.com/showingdata/sql-circuit-breaker

Gitee: https://gitee.com/LanyXP/sql-circuit-breaker

(If this helped, a Star ⭐ helps more devs who've been burned by slow SQL find it.)

DEV Community

Your circuit breaker stops at the service layer. Slow SQL needs one too.

Your circuit breaker stops at the service layer. Slow SQL needs one too.

1. A short war story: how one slow query took down a service

2. Why the usual circuit breakers aren't enough

3. What it looks like: two steps to wire in

4. The design, and the parts worth talking about

4.1 The matching unit: SQL fingerprint, not the full SQL

4.2 State machine: two states, no half-open

4.3 A misconfiguration I designed away: cache "expire-after-access" isn't configurable

4.4 It trips on timeouts only, not on exceptions

4.5 A performance detail under load: the exception doesn't fill in its stack trace

4.6 Heads up: you can't catch this breaker exception

4.7 Config priority: from one-size-fits-all to fine-grained

5. Production essential: observability

6. Other "production-grade" details I bolted on

7. Its boundaries: know what it is not

8. Wrapping up

Top comments (0)