Nazar Boyko

Posted on Jun 16 • Originally published at nazarboyko.com

Building AI APIs With Node.js

#node #openaiapi #streaming #sse

Here's an endpoint that looks completely fine:

routes/chat.ts

import OpenAI from "openai";

const openai = new OpenAI();

app.post("/api/chat", async (req, res) => {
  const completion = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: req.body.messages,
  });
  res.json(completion.choices[0].message);
});

It compiles. It works in the demo. Your PM clicks the button, the answer shows up a few seconds later, everyone claps.

Then it ships, and the cracks show up one at a time. Users stare at a spinner for eight seconds because nothing streams. A rate-limit blip on OpenAI's side turns into a 500 on your side. Finance asks how much the feature costs per user and nobody can answer. And one day a request quietly hangs for ten minutes because that's the SDK's default timeout and nobody changed it.

None of those are AI problems. They're backend problems wearing an AI costume. The model call is the easy 10%. The other 90% is the same work you'd do wrapping any flaky, expensive, slow upstream service. You've just never had an upstream that bills you per word and answers one token at a time.

This is about that 90%: streaming the response to the browser through your own server, making retries actually trustworthy, and tracking tokens so the numbers mean something. Code's in TypeScript with the official openai SDK, but the ideas port to any runtime.

The Model Call Is An Upstream Service, Treat It Like One

Before any of the fancy stuff, internalize one thing: openai.chat.completions.create() is an HTTP call to a server you don't control. It can be slow. It can rate-limit you. It can return a 500. It can hang. Every instinct you've built wrapping payment gateways and third-party APIs applies here.

The SDK gives you two surfaces. The older chat.completions.create(), the one everybody knows, and the newer responses.create(), the Responses API, which OpenAI now recommends for new work because it was designed around streaming and tool calls from the start and gives you typed, semantic events instead of raw deltas. I'll show both where they differ, because most existing code is still on Chat Completions and you'll meet it in the wild.

Start the client once, not per request:

lib/openai.ts

import OpenAI from "openai";

// Reads OPENAI_API_KEY from the environment by default.
export const openai = new OpenAI({
  timeout: 30_000,   // 30s, not the 10-minute default — more on that below
  maxRetries: 2,     // this is also the default; being explicit documents intent
});

Two options on that constructor quietly decide how your API behaves under stress. Let's earn the right to set them by understanding what they do.

Stream The Answer, Don't Make People Wait For It

The single biggest perceived-quality win for an AI feature isn't a better model. It's streaming. A response that starts appearing in 300ms feels faster than one that lands complete in 3 seconds, even though the streamed one finishes later. You're trading total time for time-to-first-token, and humans care far more about the second number.

Under the hood, when you ask for a stream the API doesn't hand you JSON. It opens a text/event-stream and pushes server-sent events: data-only SSE frames, one small chunk at a time, until it sends a terminal marker. The SDK wraps that raw stream in an async iterable so you can just loop over it.

With Chat Completions:

Streaming chat completions

const stream = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages,
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

The Responses API does the same thing with named events instead of you fishing through choices[0].delta:

Streaming the Responses API

const stream = await openai.responses.create({
  model: "gpt-4o-mini",
  input: question,
  stream: true,
});

for await (const event of stream) {
  if (event.type === "response.output_text.delta") {
    process.stdout.write(event.delta);
  }
}

That event.type switch is the whole pitch for the Responses API: you get response.output_text.delta for text, separate events for tool calls, and a terminal response.completed event, instead of one undifferentiated firehose you have to pattern-match by hand.

Relaying The Stream Through Your Own Server

Here's the part the tutorials skip. You almost never want the browser talking to OpenAI directly: your API key would be sitting in client code, and you'd have no place to enforce auth, rate limits, or logging. So your Node server sits in the middle: it consumes the OpenAI stream and re-emits it to the browser as its own SSE stream. A relay race, where your server is the runner in the middle who never gets to stop.

routes/chat.ts - SSE relay

app.post("/api/chat", async (req, res) => {
  // 1. Open an SSE response to the browser.
  res.writeHead(200, {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    Connection: "keep-alive",
  });

  // 2. Consume the upstream stream.
  const stream = await openai.responses.create({
    model: "gpt-4o-mini",
    input: req.body.messages,
    stream: true,
  });

  try {
    for await (const event of stream) {
      if (event.type === "response.output_text.delta") {
        // 3. Re-emit each chunk as our own SSE frame.
        res.write(`data: ${JSON.stringify({ text: event.delta })}\n\n`);
      }
    }
    res.write("data: [DONE]\n\n");
  } catch (err) {
    res.write(`data: ${JSON.stringify({ error: "stream_failed" })}\n\n`);
  } finally {
    res.end();
  }
});

Three things in there are non-negotiable in production and easy to forget:

The client-disconnect case. If the user closes the tab halfway through a long answer, your for await loop keeps pulling tokens from OpenAI, and you keep paying for them. Listen for req.on("close", ...) and abort the upstream request (the SDK supports an AbortController via the signal option) so a bored user doesn't run up your bill.

The error mid-stream case. Once you've sent 200 OK and started writing frames, you can't suddenly send a 500: the headers are already gone. So errors that happen after the first byte have to be communicated inside the stream, as a data: frame your client knows how to interpret. That catch block isn't optional politeness; it's the only way to tell the browser something broke.

The flush case. Behind a reverse proxy or some compression middleware, your tiny SSE frames can get buffered until they're "worth" sending, which defeats the entire point of streaming. Disable compression on this route and make sure nothing between you and the user is holding chunks hostage.

Retries: The SDK Already Does More Than You Think (And Less)

Now the unglamorous reliability work. Good news first: the SDK retries for you. By default it retries failed requests 2 times, with a short exponential backoff, on exactly the errors that are worth retrying: connection errors, 408 Request Timeout, 409 Conflict, 429 Rate Limit, and any 5xx. It reads the Retry-After header when one's present instead of guessing. You can tune or kill that behavior:

Tuning retries

// Globally, on the client:
const openai = new OpenAI({ maxRetries: 3 });

// Or per request, when one call deserves different treatment:
await openai.responses.create(
  { model: "gpt-4o-mini", input },
  { maxRetries: 5 },
);

That covers transient failures better than the hand-rolled try/catch most people would write. So where's the catch?

The catch is that retries and streaming don't mix the way you'd hope. The automatic retry happens during connection setup, before the first byte arrives. Once a stream has started flowing and dies in the middle, the SDK can't transparently retry it, because it would have to replay the half-delivered response. Half a token stream is gone. If resilience mid-stream matters to you, you own that: catch the error, and either restart the whole generation or accept the partial answer. There's no free lunch on a connection that's already talking.

The second catch is subtler and more dangerous. Retries are only safe on idempotent operations, and an LLM call usually isn't one, especially once it can call tools. If your model invocation triggers a tool that charges a card or sends an email, an automatic retry on a 409 or a timeout can fire that side effect twice. The request "failed" from the SDK's point of view, but the tool already ran. And the SDK won't save you here: it doesn't send an idempotency key automatically. There's an optional idempotencyKey request option, but you have to set it yourself, so nothing dedupes your retries unless you wire it up. The rule from regular backend work holds exactly: make the side effects idempotent, or don't let them auto-retry. Streaming and tools make this easy to forget; the bill and the duplicate emails will remind you.

Warning
The SDK's default request timeout is 10 minutes (600,000 ms). That's a sane default for a batch job and a terrible one for a user-facing endpoint. A wedged request will hold a connection, a worker slot, and the user's patience for ten full minutes before giving up. Set timeout to something humane (20-60s for interactive calls) on day one. When a request does time out, the SDK throws APIConnectionTimeoutError and, yes, retries it twice by default.

Token Tracking: The Number That's Null Until It Isn't

Every call costs money measured in tokens, split into input (your prompt) and output (the model's answer). If you want per-user cost, per-feature cost, or just an alert before someone's runaway loop spends your quarterly budget, you have to capture token usage on every call. This is where streaming sets a trap.

On a normal, non-streamed call, usage is right there in the response:

Usage on a non-streamed call

const res = await openai.responses.create({ model: "gpt-4o-mini", input });
console.log(res.usage);
// { input_tokens, output_tokens, total_tokens }

Easy. Now stream the same call and reach for chunk.usage, and you'll find it's null. On every chunk. The counterintuitive bit that bites everyone exactly once: when you stream Chat Completions, usage isn't reported by default at all, and even when you turn it on, it lives only on a final extra chunk sent after the content is done, a chunk whose choices array is empty. You have to opt in:

Getting usage out of a Chat Completions stream

const stream = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages,
  stream: true,
  stream_options: { include_usage: true }, // <- the line everyone forgets
});

let usage;
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) res.write(`data: ${JSON.stringify({ text: delta })}\n\n`);

  // usage is null on every chunk EXCEPT the final one.
  if (chunk.usage) usage = chunk.usage;
}

// Now you can log it.
await recordUsage({ userId, model: "gpt-4o-mini", usage });

The Responses API is friendlier here: its terminal response.completed event carries the finished response object, usage block included. But the underlying truth is the same: usage is a property of the completed generation, not of any individual token, so it can't show up until the stream is over. Once you've got that mental model, the nulls stop being surprising.

What you do with the numbers is the actual feature:

lib/usage.ts - turning tokens into money and limits

// Prices are per 1M tokens and change often — keep them in config, not code.
const PRICING = {
  "gpt-4o-mini": { input: 0.15, output: 0.60 }, // example shape, not live rates
};

export async function recordUsage({ userId, model, usage }) {
  const p = PRICING[model];
  const cost =
    (usage.input_tokens / 1_000_000) * p.input +
    (usage.output_tokens / 1_000_000) * p.output;

  await db.usage.insert({ userId, model, ...usage, cost, at: new Date() });

  // Cheap guardrail: stop a runaway user before the invoice does.
  const spentToday = await db.usage.sumCostSince(userId, startOfDay());
  if (spentToday > DAILY_LIMIT) {
    throw new SpendLimitError(userId);
  }
}

Don't hardcode prices in the middle of business logic: they change, and you don't want a deploy every time they do. And log the raw token counts, not just the dollar figure: when you switch models or renegotiate pricing, you'll want to re-cost history, and you can only do that if you kept the tokens.

Putting It Together

Strip away the AI and what's left is a checklist you already know how to read. Treat the model as a flaky upstream and give it a real timeout. Stream through your own server so you control auth, logging, and the bill, then handle disconnects and mid-stream errors, because those will happen. Lean on the SDK's built-in retries, but remember they stop at the first byte and that retrying a tool-calling request can double a side effect. Capture token usage on every call, knowing it only shows up at the end of a stream and only if you ask for it.

The demo endpoint at the top of this post isn't wrong, exactly. It's just unfinished. It's the 10%. The gap between that and something you'd put your name on is ordinary, careful backend engineering. The model is the new part. Everything that makes it survive contact with real users is work you've done a hundred times before.

Originally published at nazarboyko.com.

Top comments (1)

Diya • Jun 16

The relay race analogy clicked immediately. The mid-stream error section is the part most tutorials completely skip — once you've sent 200 OK you can't go back, so errors have to live inside the stream itself. Hit that exact wall building a WebSocket streaming gateway. The stream_options: { include_usage: true } trap is also real; token counts silently null for weeks is a painful way to discover it. The "model call is 10%, backend is 90%" framing should be at the top of every AI integration tutorial.