Seiya Izumi

Posted on Jun 16

Curing a Transcription Worker That Kept Dying on Cloud Run with Streaming

#architecture #software #performance

An audio transcription worker kept getting killed by OOM on Cloud Run.

Bumping the memory allocation calmed it down for a while, but that only inflated the bill without fixing anything. Crash, raise the limit, crash again — that was the steady state.

In the end, changing how the processing was structured improved memory, cost, transcription accuracy, and testability all at once. This post is a record of the path I took from treating symptoms to curing the root cause.

The punchline up front: what actually worked wasn't a clever algorithm. It was simply dropping "load everything into memory, then process" in favor of a streaming design — "read and forward as you go." That's it. But getting there required clearly articulating why it was dying, and identifying where the existing setup was carrying a structural impossibility.

What the System Did

It transcribed the audio from user-uploaded mp4 files by sending it to a third-party transcription Web API. The slightly awkward part: that API only accepts wav. Uploads come in as mp4, the API wants wav. The conversion had to happen somewhere.

The runtime was GCP Cloud Run, the language was Go. The architecture was an asynchronous job: an upload enqueued a task to Cloud Tasks, which dispatched a worker that ran the transcription.

The Setup I Inherited, and Its Limits

By the time I took it over, the inside of the worker worked like this:

Launch ffmpeg from the application process
Split the mp4 into 15-second wav files
Load each wav into memory and send it to the transcription API

It worked. But the worker was dying to OOM constantly.

Why? Breaking it down calmly, two memory-hungry factors were stacked on top of each other:

ffmpeg ran as a separate process, carrying its own resident memory
wav is uncompressed, so it sat in memory far heavier than the equivalent mp4, and the design held that data in memory

And Cloud Run handles multiple tasks concurrently within a single container. So when several transcription tasks piled into the same container, this heavy processing multiplied accordingly. The moment single-task peak memory × concurrent tasks exceeded the container's memory ceiling, the whole container went down to OOM. A single large mp4 could push it over on its own, and overlapping tasks made it worse. In short, it was a design guaranteed to fall over eventually.

What the Number "15" Was Telling Me

What snagged my attention while reading the code was why the split unit was 15 seconds.

Digging in, it turned out to be a compromise the previous engineer had found within the memory constraints. Make the chunks longer and each wav gets heavier, pressuring memory and increasing OOMs. Make them shorter and memory eases up, but now utterances get chopped at chunk boundaries, hurting transcription accuracy — and the request count goes up too.

15 seconds was the equilibrium point of that tug-of-war.

And here's what struck me: the number 15 wasn't a solution to the problem — it was a constraint born out of being shackled to the memory limit. Memory and accuracy, two genuinely separate concerns, were pushing against each other on a single parameter: the split length. Raise one and the other falls. The previous engineer had been carefully balancing on top of that, but if the tradeoff didn't exist in the first place, nobody would have to agonize over it.

Another Parameter, Set Carelessly

The same structure showed up in the concurrency setting. The worker's concurrency had no sign of being deliberately tuned — it was set roughly. My guess is it came from a cost/ops instinct: "I don't want to spin up a lot of instances."

But this meshed in the worst possible way. To save on instances, you set concurrency high. Multiple tasks then pile into one container, the heavy memory processing multiplies, and OOM kills the container.

The setting meant to reduce instances was, instead, killing the containers. Raise memory and the cost climbs; concurrency tugs against instance count; split length wavers between memory and accuracy. Small tradeoffs tangled everywhere, and the whole thing was stuck at an equilibrium nobody was happy with.

Don't Let the Diagnosis Stay a Guess

Before touching anything, I first confirmed that memory really was the cause. Deciding the reason by hunch wastes time on misdirected fixes.

I used Cloud Trace to repeatedly follow which segments of a request inflated resources, and how that correlated with container OOM terminations. That narrowed it down to the wav memory loading and the ffmpeg process. Once the source is pinned down, the direction of the fix follows naturally — reduce what gets loaded into memory.

The Shift in Thinking: Stream While Reading, Not Load Then Forward

The essential waste in the old setup was right here: expanding the entire file (or the entire split wav) into memory before sending it to the API. But looking closely, the transcription API actually supported streaming input.

So this became possible: read the mp4 from storage a little at a time, convert it into audio chunks on the fly, and pipe it straight to the API. There's no need to pool the whole thing in memory anywhere. Only the small chunk currently being processed sits in memory, and what has finished streaming is released.

Concretely: read the mp4 from storage incrementally with a section reader, parse the container with a stable third-party demuxer to extract audio chunks, decode those with fdk-aac, and pipe the result to the transcription API as a stream.

The crux of this design is that peak memory stops depending on file size. Ten minutes of audio or an hour, what sits in memory at once is only the chunk being processed. The peak becomes roughly constant.

What to Do with ffmpeg — Isolation, Not Elimination

Here I had a call to make. The old setup used ffmpeg for the mp4 → wav conversion. What to do with it in the new one?

I wanted to avoid launching ffmpeg as a separate process on every transcription. The reasons stacked up:

Spawning a process per task means processes proliferate with concurrency, and resident memory piles up
Process startup/teardown has latency, and handing data over stdin/stdout copies it across the process boundary
Above all, an external process's memory usage can't be controlled or observed from the application side

Even if streaming keeps the application's peak memory down, if I can't hold the reins on what an external process consumes, I can't guarantee memory by design. On Cloud Run, where the memory allocation maps directly onto the bill, being able to guarantee the peak ties directly to severing the cost-vs-stability tradeoff.

That said, I couldn't eliminate ffmpeg entirely. The mp4 files users upload vary in codec and container layout. The demuxer and fdk-aac can cleanly handle only known, standard formats. Push arbitrary input straight into the hot path and it breaks easily on edge cases.

So I split it like this: run ffmpeg exactly once at upload time, re-encoding into a form fdk-aac can handle while normalizing the container and metadata. From there on, the pipeline assumes only normalized mp4 flows through it.

In other words, ffmpeg wasn't eliminated. The heavy, potentially-unstable re-encoding work was pushed out of the repeatedly-executed transcription path and into a one-time pre-processing step at upload. No matter how many times transcription runs, no matter how many retries fire, the re-encode never runs again. The hot path only ever deals with clean, standard input, free from edge cases.

Looking back, this was the same idea as severing the memory multiplication: push heavy and unstable work out of the repeatedly-executed path into a one-time pre-process, and keep the hot path light and predictable. The same principle ran through both the memory problem and the input-handling problem.

What Paid Off

Memory and Cost

Peak memory stopped depending on file size, and memory spikes vanished in the vast majority of cases. It wasn't just that OOM went away — because the peak became constant, the container stopped dying even with a lower memory allocation, which translated straight into cost savings. The "raise memory ⇄ crash" tug-of-war itself disappeared.

Transcription Accuracy

This was a side effect I hadn't aimed for, and it was a big one. By dropping fixed-length splitting and streaming the audio as a continuous stream, the problem of utterances getting chopped at chunk boundaries disappeared by construction. The API can process with context intact, so dropped or misrecognized words straddling boundaries went away. The accuracy problem the previous engineer had been desperately balancing with the number 15 simply resolved itself as a byproduct of the design change.

The Parameters Disappeared

Both split length and concurrency had been parameters demanding tightrope tuning. With the ceiling on peak memory lowered, the three-way tradeoff of "instances I want to run ⇄ concurrency ⇄ memory safety" loosened all at once, and neither needed nervous adjustment anymore. Rather than tweaking values as a symptomatic fix, I changed the structure of the problem — so the very things needing adjustment ceased to exist.

Tests Became Writable

Modest but meaningful: testability. The old setup depended on ffmpeg as an external process, so verifying it meant actually launching that process — effectively only possible via E2E. Slow, fragile, environment-dependent.

In the new setup, both demux and fdk-aac decoding became in-process library calls, so I could verify units like "feed this input, get this chunk, get this decode result" with unit tests. In practice it was still E2E-only at the time, and expanding unit test coverage remained on the to-do list — but at least the structure now allowed it.

Lined up side by side — control over memory, ease of mid-stream resumption, testability — these were all consequences of one thing: pulling state that had been held by an external process back into my own code. A single decision brought several good things along with it.

Design Calls and the Tradeoffs I Accepted

It wasn't all a tidy success story. Some things I consciously settled for.

fdk-aac as the decoder. It was what I landed on after searching for an AAC decoder usable from cgo at the time. Pure-Go implementations exist, but many cap out at the AAC-LC profile, and I had no confidence they'd cover the variety of input. cgo complicates the build, so I prepared the native dependency as a Docker image and pulled only the build artifact in via multi-stage build at Cloud Run deploy time, eliminating the need to recompile fdk-aac on every build.
The normalization phase has its own cost. Re-encoding involves audio codec conversion, so if it's lossy the audio quality degrades once (whether that's within a range that doesn't affect transcription accuracy is worth examining). Storing the normalized file means holding it alongside the original, and it inserts an extra step between upload and transcription-ready. I judged the payoff of a stable hot path to be worth more.
No support for formats other than mp4 — by design from the start. Input was a domain restricted to mp4, so I avoided over-abstraction (YAGNI).
Not writing the demuxer from scratch was deliberate too. Container parsing is a minefield of edge cases; rolling my own would shoulder all the breakage risk. Riding on a mature third-party library sidestepped that minefield. Working at a low level wasn't the goal — running stably was.

Closing

The lesson from this work is simple: if a parameter has to be tuned endlessly to stay stable, it may not be a value to adjust but a structure to remove.

The 15-second split length, the carelessly-set concurrency, the up-and-down of the memory allocation — all of it was tug-of-war happening on top of one foundation: "load everything into memory, then process." The instant I changed the foundation to streaming, the tug-of-war itself vanished.

And the idea of "pushing heavy, unstable work out of the repeatedly-executed path into a one-time pre-process" applies well beyond this one incident. Keep the hot path light and predictable. Reclaim, into your own code, the state an external process was holding. Do that, and memory, cost, and testability tend to follow.

DEV Community