DEV Community

Cover image for How I Took a Client's Full-Stack Platform From Zero Infrastructure to 100K-User Ready — Starting at ~$25/Month
Farouq Seriki
Farouq Seriki

Posted on • Originally published at fasthedev.netlify.app

How I Took a Client's Full-Stack Platform From Zero Infrastructure to 100K-User Ready — Starting at ~$25/Month

I recently took a client's full-stack platform from "it works on my laptop" to production-grade infrastructure that can handle 100,000 users — across a Node.js backend, Next.js web app, and React Native mobile app.

The client was an AWS-first developer. The instinct was to solve everything with AWS services. But the reality of a startup is that you need to move fast, keep costs near zero at launch, and scale without re-architecting. I convinced him to go multi-vendor — picking the best tool for each job instead of forcing everything into one vendor's ecosystem and paying for features he didn't need.

This isn't a "use these 7 services" listicle. This is the real decision-making process behind each choice, the actual code I shipped, and the cost math that made it work.


šŸ” The Starting Point

The platform had:

  • Backend: Express 5 + TypeScript + PostgreSQL (Drizzle ORM), deployed on AWS App Runner
  • Web: Next.js (App Router) + React 19 + Tailwind, deployed on Vercel
  • Mobile: React Native (Expo) + TypeScript

What it didn't have:

  • āŒ No caching layer (every request hit the database)
  • āŒ No error tracking (errors disappeared into CloudWatch logs nobody checked)
  • āŒ No uptime monitoring (I'd learn about downtime from users)
  • āŒ No analytics (zero insight into what users actually do)
  • āŒ No CDN for assets (S3 presigned URLs generated per-request)
  • āŒ No graceful shutdown (deploys killed active connections)
  • āŒ No API resilience (one server hiccup = failed request)
  • āŒ N+1 query patterns in critical paths
  • āŒ No pagination on endpoints returning entire tables

The backend was doing real work — 14 database tables, JWT auth with OTP, real-time messaging via Socket.IO, payment processing, S3 file uploads. But it was held together with hope. One traffic spike and the database pool would exhaust, the server would run out of memory and crash, and nobody would know until a user complained on WhatsApp.


āš–ļø Scale Up vs. Scale Out: I Chose Both

Before touching any code, I had to answer the fundamental question: when traffic grows 10x, do you make the boxes bigger (scale up) or add more boxes (scale out)?

The answer: I architected for scale-out, but the database scales up first.

Layer Strategy Why
šŸ–„ļø Backend (App Runner) Scale out Auto-adds containers based on concurrent requests. No instance to resize.
🌐 Web (Vercel) Scale out Serverless functions — each request gets isolated compute.
⚔ Cache (Upstash Redis) Scale out Serverless, per-request pricing. No nodes to manage.
šŸŒ CDN (CloudFront) Scale out 400+ edge locations. Distributed by design.
šŸ—„ļø Database (PostgreSQL) Scale up, then out Bigger instance first, then read replicas when query load demands it.
šŸ“± Mobile N/A Runs on user devices. I optimize, not scale.

The principle: make the stateless layers horizontally scalable, and protect the single stateful layer (the database) with caching and query optimization so it doesn't need to scale until much later.

This is why caching was task #1, not #5.


šŸ—„ļø Layer 1: Database — The Foundation Nobody Sees

Before adding any external service, I fixed the database layer. The fastest cache is a query that doesn't run.

Indexes Matched to Actual Query Shapes

The listings table is the most-read table in the system. Every homepage load, every search, every filter hits it. But there were zero indexes beyond the primary key.

Naively, you'd slap a single-column index on every WHERE clause column. I did that first — then looked at the actual queries and realized most of those indexes would never be used. A status index on a column with only 4 possible values? Postgres will seq-scan that anyway. A city index when every query uses LIKE '%lagos%'? B-tree can't help there.

Instead, I matched indexes to the actual query shapes:

-- Partial index: only indexes published listings (80% of queries)
-- Eliminates both the filter AND the sort step for the public feed
CREATE INDEX listings_published_feed_idx
  ON listings (published_at DESC)
  WHERE status = 'published';

-- Composite indexes: lead column for equality, second for sort
-- "Show me host 7's bookings, newest first" → one index scan, no sort
CREATE INDEX bookings_host_date_idx
  ON bookings (host_id, start_date DESC);

CREATE INDEX bookings_user_date_idx
  ON bookings (user_id, start_date DESC);
Enter fullscreen mode Exit fullscreen mode

The partial index is the big win. The public feed query (WHERE status = 'published' ORDER BY published_at DESC LIMIT 20) went from scanning every row in the table to reading exactly 20 index entries. I benchmarked on a 20K-row synthetic dataset: buffer reads dropped from 649 to 22 (the database touches 97% fewer pages), and the sort step disappears entirely — Postgres walks the index in order instead of sorting in memory. Wall-clock times went from 242ms to 0.14ms, though in production the public feed is cached with a 2-minute TTL so most requests never hit this query at all. The real win isn't a raw timing multiplier — it's that the query stops scaling with table size. The composite booking indexes showed the same pattern: 903 → 22 buffer reads for host bookings, 900 → 22 for guest bookings.

These 3 indexes replace what would otherwise be 8 single-column indexes (one per filtered column). The design avoids redundant indexes rather than piling them on — every index costs storage and a per-row write penalty, so the skill is matching the index to the query shape, not maximizing index count.

šŸ’° Cost: $0 on the bill, but not free in overhead — every index adds storage and a write penalty per row insert/update. That's exactly why I used 3 targeted indexes instead of 8 redundant single-column ones. For a read-heavy marketplace the tradeoff is worth it, especially since the highest-write table (messages) is insert-only — the most index-friendly write pattern there is.

Connection Pool Tuning

The default pg pool ships with max: 10 connections. For a server handling concurrent API requests and Socket.IO connections, that's a bottleneck waiting to happen.

export const pool = new Pool({
  connectionString: config.db.url,
  max: 20,                        // doubled from default
  idleTimeoutMillis: 30_000,      // release idle connections after 30s
  connectionTimeoutMillis: 5_000, // fail fast if pool is exhausted
});
Enter fullscreen mode Exit fullscreen mode

Why 20 and not 100? Because PostgreSQL itself has a max_connections limit (typically 100 on small managed instances, configurable higher on larger ones). If you have 3 App Runner containers each holding 20 connections, you're at 60 — safely under the limit with headroom for admin connections. Going higher risks connection exhaustion at the database level, which is much harder to debug than a pool timeout. As you scale past ~5-10 App Runner instances, consider adding a connection pooler like PgBouncer or Supabase's built-in pooler to multiplex connections.

Pagination: Stop Returning Entire Tables

The public listings endpoint was doing SELECT * FROM listings WHERE status = 'published'. For 50 listings, fine. For 5,000? You're serializing megabytes of JSON per request. 😬

router.get('/', async (req, res) => {
  const limit = Math.min(Math.max(Number(req.query.limit) || 20, 1), 50);
  const offset = Math.max(Number(req.query.offset) || 0, 0);

  // Count total for pagination metadata
  const [{ count: total }] = await db
    .select({ count: sql<number>`count(*)` })
    .from(listings)
    .where(eq(listings.status, 'published'));

  const rows = await db
    .select()
    .from(listings)
    .where(eq(listings.status, 'published'))
    .orderBy(desc(listings.publishedAt))
    .limit(limit)
    .offset(offset);

  res.json({
    listings: rows,
    total,
    limit,
    offset,
    hasMore: offset + rows.length < total,
  });
});
Enter fullscreen mode Exit fullscreen mode

Why limit/offset and not cursor-based? For a marketplace listing page, users jump to arbitrary pages ("show me page 7"). Cursor-based pagination is better for infinite scroll (which the mobile app does), but the web app needs random access. I cap at 50 items per page — that's the safety valve.

Caveat: OFFSET pagination degrades on deep pages — OFFSET 10000 means the database still scans and discards 10,000 rows. For a marketplace with <10K published listings, this is fine. If listings grow past that, I'd switch the mobile app's infinite scroll to cursor-based (WHERE published_at < :lastSeen ORDER BY published_at DESC LIMIT 20) while keeping offset for the web's page-jump UI. The count(*) also runs on every uncached request — for now the Redis cache absorbs this, but at scale I'd consider a materialized count or remove exact totals in favor of "has more" pagination.

šŸ› Killing the N+1 in Conversations

The conversations endpoint had two problems. First, it loaded every message for every conversation into memory just to find the last message and compute unread counts — for a user with 30 conversations averaging 50 messages each, that's 1,500 rows loaded to produce 30 numbers. Second, after that was partially fixed, the unread count still ran one query per conversation in a loop (N+1).

Fix #1: Last message per conversation — replaced "load all messages, deduplicate in JS" with a single DISTINCT ON query:

-- One query returns exactly one row per conversation: the latest non-null message
SELECT DISTINCT ON (conversation_id)
  conversation_id, body
FROM messages
WHERE conversation_id IN (6, 7, 8, 9, ...)
  AND body IS NOT NULL
ORDER BY conversation_id, created_at DESC
Enter fullscreen mode Exit fullscreen mode

Fix #2: Unread counts — replaced per-conversation loop with a single JOIN + GROUP BY:

// readAtCol resolves to the correct column based on caller's role
const readAtCol = auth.entityType === 'user'
  ? conversations.userLastReadAt
  : auth.entityType === 'host'
    ? conversations.hostLastReadAt
    : conversations.adminLastReadAt;

const unreadResults = await db
  .select({
    conversationId: messages.conversationId,
    count: sql<number>`count(*)::int`,
  })
  .from(messages)
  .innerJoin(conversations, eq(conversations.id, messages.conversationId))
  .where(and(
    inArray(messages.conversationId, conversationIds),
    sql`${messages.createdAt} > coalesce(${readAtCol}, '1970-01-01'::timestamp)`
  ))
  .groupBy(messages.conversationId);
Enter fullscreen mode Exit fullscreen mode

The key design: unread tracking uses a lastReadAt timestamp per role on the conversation row (not a per-message readAt flag). The COALESCE to epoch handles the "never read" case — if lastReadAt is NULL, every message counts as unread.

Both fixes together: N queries → 2 queries, regardless of conversation count. On 2K conversations with 100K messages, each query runs in under 1ms.

šŸ’° Cost: $0. Better queries are free.


⚔ Layer 2: Caching — Upstash Redis Over ElastiCache

This was the first architecture decision where I diverged from the client's AWS instinct.

The AWS answer: ElastiCache (managed Redis). Minimum cost: ~$12/month for a cache.t4g.micro running 24/7, even if you serve 10 requests a day.

What I chose: Upstash Redis. Cost: free for the first 500K commands/month, then $0.2 per 100K commands. For a startup doing 500 requests/day (~15K commands/month), that's well within the free tier indefinitely.

Why Upstash over ElastiCache:

Factor ElastiCache Upstash
šŸ’° Cost at 0 traffic ~$12/mo $0
šŸ’° Cost at 100K req/day ~$12/mo ~$6/mo
šŸ”§ Provisioning VPC, security groups, subnet groups One API key
šŸ”Œ Connection model Persistent TCP (needs VPC Connector) HTTP REST (works from anywhere)
šŸ“ˆ Scaling Manual instance resize Automatic

The HTTP-based model is the key insight. App Runner runs in AWS's managed VPC — connecting it to ElastiCache requires setting up a VPC Connector (additional config, and your ElastiCache must be in a VPC with compatible subnets). Upstash uses REST over HTTPS, so it works from anywhere with zero networking config.

The Cache Implementation

The cache utility is 53 lines and handles its own failures gracefully:

import { Redis } from '@upstash/redis';

let redis: Redis | null = null;

const getRedis = () => {
  if (redis) return redis;
  if (!config.cache.redisUrl || !config.cache.redisToken) return null;
  redis = new Redis({ url: config.cache.redisUrl, token: config.cache.redisToken });
  return redis;
};

export const cacheGet = async <T>(key: string): Promise<T | null> => {
  const client = getRedis();
  if (!client) return null;
  try {
    return await client.get<T>(key);
  } catch {
    return null; // cache miss on error — DB still serves the request
  }
};

export const cacheSet = async (
  key: string, value: unknown, ttlSeconds: number
): Promise<void> => {
  const client = getRedis();
  if (!client) return;
  try {
    await client.set(key, value, { ex: ttlSeconds });
  } catch {
    // write-through failure — not critical
  }
};

export const cacheDel = async (...keys: string[]): Promise<void> => {
  const client = getRedis();
  if (!client || keys.length === 0) return;
  try {
    await client.del(...keys);
  } catch {
    // invalidation failure — stale data expires via TTL
  }
};
Enter fullscreen mode Exit fullscreen mode

Three design decisions worth calling out:

  1. šŸ”„ Lazy initialization: Redis client is created on first use, not at import time. If credentials aren't set, the app works normally — just without caching.

  2. šŸ›”ļø Silent failure: Every cache operation is wrapped in try/catch that returns null or no-ops. The cache is an optimization, not a dependency. If Upstash has an outage, your app slows down but doesn't crash.

  3. ā° TTL-based expiry as a safety net: Even if cacheDel fails (network blip during a write), stale data expires in 2-3 minutes via TTL. This is the cache-aside pattern — you don't need perfect invalidation if you have TTLs.

Cache-Aside in Practice

// In public-listings.router.ts
const cacheKey = `public-listings:${JSON.stringify({ limit, offset, city, minPrice, maxPrice })}`;

// 1ļøāƒ£ Try cache first
const cached = await cacheGet(cacheKey);
if (cached) return res.json(cached);

// 2ļøāƒ£ Cache miss — query database
const rows = await db.select()...;
const response = { listings: rows, total, limit, offset, hasMore };

// 3ļøāƒ£ Populate cache for next request (2 min TTL)
await cacheSet(cacheKey, response, 120);

return res.json(response);
Enter fullscreen mode Exit fullscreen mode

And on the write side, I invalidate all cached listing pages when data changes. Since Redis DEL doesn't support glob patterns, I use Upstash's SCAN + DEL approach:

// In listing.router.ts — PATCH, DELETE, publish, draft
// Invalidate all public-listing and home-feed cache entries
const keys = await scanKeys('public-listings:*', 'home-feed:*');
if (keys.length > 0) await cacheDel(...keys);
Enter fullscreen mode Exit fullscreen mode

The scanKeys helper uses Upstash's scan command to find keys matching a pattern, then cacheDel removes them in one batch. This is important because Redis's DEL command only accepts exact key names — you can't pass DEL public-listings:* and expect it to work like a glob. The TTL-based expiry acts as a safety net if SCAN misses anything.

šŸ“Š Result: Public listing pages that took 200-400ms on first load now serve in <50ms on cache hit. The home feed (the most-hit endpoint) went from 350ms to 30ms. That's an 8x improvement for zero monthly cost.


🚨 Layer 3: Error Tracking — Sentry Across Three Platforms

Before this, errors went to console.error and CloudWatch. CloudWatch is where errors go to die — the interface is hostile, there's no grouping, no stack traces from minified code, and you can't see which user was affected.

Why Sentry over AWS CloudWatch/X-Ray:

Factor CloudWatch/X-Ray Sentry
šŸ” Error grouping Manual log filtering Automatic fingerprinting
šŸ“„ Stack traces (minified) Raw text Source map support
šŸ‘¤ User context DIY Built-in
šŸ”” Alert rules CloudWatch Alarms (complex) 2 clicks
šŸ’° Cost Pay per log GB ingested Free (5K errors/mo)
ā±ļø Setup time Hours (IAM, log groups, alarms) Minutes

I set up three Sentry projects — one per platform — all under one organization.

šŸ–„ļø Backend: Express Error Handler

// instrument.ts — must be the FIRST import in server.ts
import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN ?? '',
  environment: process.env.NODE_ENV ?? 'development',
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.2 : 1.0,
  sendDefaultPii: false,
});
Enter fullscreen mode Exit fullscreen mode
// app.ts — after all routes, before 404 handler
Sentry.setupExpressErrorHandler(app);
Enter fullscreen mode Exit fullscreen mode

That's it. Two lines of config, one line in the Express pipeline. Every unhandled exception now shows up in Sentry with the request URL, headers, user context, and a full stack trace. ✨

🌐 Web: Next.js with Source Maps

// sentry.client.config.ts
import * as Sentry from "@sentry/nextjs";

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: process.env.NODE_ENV === "production" ? 0.2 : 1.0,
});
Enter fullscreen mode Exit fullscreen mode
// next.config.ts
import { withSentryConfig } from "@sentry/nextjs";

export default withSentryConfig(nextConfig, {
  silent: true,
  org: "your-sentry-org",
  project: "your-web-project",
});
Enter fullscreen mode Exit fullscreen mode

The withSentryConfig wrapper uploads source maps during build so that production errors show original TypeScript line numbers, not minified garbage.

The global-error.tsx boundary catches React rendering errors:

"use client";
import * as Sentry from "@sentry/nextjs";
import { useEffect } from "react";

export default function GlobalError({ error, reset }: {
  error: Error & { digest?: string };
  reset: () => void;
}) {
  useEffect(() => { Sentry.captureException(error); }, [error]);

  return (
    <html><body>
      <h2>Something went wrong</h2>
      <button onClick={() => reset()}>Try again</button>
    </body></html>
  );
}
Enter fullscreen mode Exit fullscreen mode

šŸ“± Mobile: React Native

// _layout.tsx
import * as Sentry from '@sentry/react-native';

Sentry.init({
  dsn: process.env.EXPO_PUBLIC_SENTRY_DSN,
  tracesSampleRate: __DEV__ ? 1.0 : 0.2,
});

// Wrap root component
export default Sentry.wrap(RootLayout);
Enter fullscreen mode Exit fullscreen mode

The tracesSampleRate tradeoff: I set it to 0.2 in production (sample 20% of transactions for performance monitoring) and 1.0 in development. At 100K users, 20% sampling still gives you statistically significant performance data without burning through Sentry's free tier quota. If you're debugging a specific performance issue, temporarily bump it to 1.0, fix it, and drop it back.

šŸ’° Cost: $0 (free tier: 5K errors/month shared across all projects, 30-day retention). Note: errors and performance transactions are billed against separate quotas — 5K errors plus a distinct 10K performance-units allowance. The tracesSampleRate setting governs the performance-units quota, so sampling traces protects your transaction allowance without ever eating into your 5K error capacity.


šŸ“Š Layer 4: Analytics — PostHog Over Mixpanel/Amplitude

The client wanted heatmaps and session replays to understand user behavior. Also needed feature flags to push promo badges to mobile without app store updates.

Why PostHog:

Feature Mixpanel Amplitude PostHog
šŸ”„ Heatmaps āœ… (added 2025) āŒ āœ…
šŸŽ„ Session replay āœ… (10K free/mo) āœ… (limited) āœ… (5K free/mo)
🚩 Feature flags āŒ āœ… (free tier) āœ…
šŸ“± Mobile SDK āœ… āœ… āœ…
šŸ’° Free tier 20M events 50K MTUs 1M events
šŸ  Self-host option āŒ āŒ āœ…
šŸŽÆ Single tool for everything āŒ āŒ āœ…

The landscape has shifted — Mixpanel and Amplitude both added session replay recently. But PostHog still wins here because it bundles all of this (analytics, heatmaps, replay, feature flags, A/B testing) under one SDK and one dashboard with no separate billing per product. Mixpanel still doesn't do feature flags, Amplitude's replay is limited, and neither offers self-hosting. For a startup that needs all these capabilities without managing multiple vendor contracts, PostHog is the only single-vendor option.

🌐 Web: Next.js App Router Integration

Next.js App Router uses client-side navigation — the browser doesn't do full page loads when you click links. This means the PostHog SDK's automatic pageview capture misses most navigation. You need a custom PageViewTracker:

"use client";
import posthog from "posthog-js";
import { PostHogProvider as PHProvider, usePostHog } from "posthog-js/react";
import { useEffect } from "react";
import { usePathname, useSearchParams } from "next/navigation";

if (typeof window !== "undefined" && process.env.NEXT_PUBLIC_POSTHOG_KEY) {
  posthog.init(process.env.NEXT_PUBLIC_POSTHOG_KEY, {
    api_host: process.env.NEXT_PUBLIC_POSTHOG_HOST,
    person_profiles: "identified_only",
    capture_pageview: false,   // disable auto — I capture manually
    capture_pageleave: true,
  });
}

function PageViewTracker() {
  const pathname = usePathname();
  const searchParams = useSearchParams();
  const ph = usePostHog();

  useEffect(() => {
    if (pathname && ph) {
      let url = window.origin + pathname;
      if (searchParams.toString()) url += `?${searchParams.toString()}`;
      ph.capture("$pageview", { $current_url: url });
    }
  }, [pathname, searchParams, ph]);

  return null;
}

export function PostHogProvider({ children }: { children: React.ReactNode }) {
  if (!process.env.NEXT_PUBLIC_POSTHOG_KEY) return <>{children}</>;
  return (
    <PHProvider client={posthog}>
      <PageViewTracker />
      {children}
    </PHProvider>
  );
}
Enter fullscreen mode Exit fullscreen mode

Key decisions:

  • person_profiles: "identified_only" — don't create user profiles for anonymous visitors. This drastically reduces event volume (and cost). You only track identified users who've logged in.
  • capture_pageleave: true — know when users bounce. Combined with pageviews, this gives you session duration without session replay.
  • šŸ›”ļø Graceful degradation — if the PostHog key isn't set, the provider renders children without the SDK. No errors, no broken UI.

šŸ“± Mobile: React Native with Autocapture

import { PostHogProvider } from 'posthog-react-native';

export function AppProviders({ children }: { children: React.ReactNode }) {
  return (
    <GestureHandlerRootView style={styles.root}>
      <PostHogProvider
        apiKey={process.env.EXPO_PUBLIC_POSTHOG_KEY}
        options={{ host: process.env.EXPO_PUBLIC_POSTHOG_HOST }}
        autocapture
      >
        <QueryProvider>
          <SafeAreaProvider>
            {children}
          </SafeAreaProvider>
        </QueryProvider>
      </PostHogProvider>
    </GestureHandlerRootView>
  );
}
Enter fullscreen mode Exit fullscreen mode

The autocapture prop is the magic — it automatically captures screen views, button taps, and navigation events without any manual instrumentation. For the 80% of analytics questions ("which screen do users visit most?", "where do they drop off in onboarding?"), autocapture is enough.

🚩 Feature Flags for Mobile Promo Badges

This was a specific client requirement: push promotional badges to the mobile app without going through App Store review. šŸŽ

In PostHog, create a feature flag promo-summer-sale with a JSON payload:

{ "badge": "20% OFF", "color": "#FF6B35", "expires": "2026-08-31" }
Enter fullscreen mode Exit fullscreen mode

In the mobile app:

import { useFeatureFlag, useFeatureFlagPayload } from 'posthog-react-native';

function ListingCard({ listing }) {
  const showPromo = useFeatureFlag('promo-summer-sale');
  const promoData = useFeatureFlagPayload('promo-summer-sale');

  return (
    <View>
      {showPromo && (
        <Badge color={promoData.color}>{promoData.badge}</Badge>
      )}
      {/* rest of card */}
    </View>
  );
}
Enter fullscreen mode Exit fullscreen mode

Toggle the flag in PostHog's dashboard — the badge appears or disappears for all users within minutes. No app update, no App Store review, no deployment. That's the kind of operational leverage that separates "I shipped a feature" from "I can run a business." šŸŽÆ

šŸ’° Cost: Free (1M events/month, 5K session recordings/month).


šŸ”„ Layer 5: API Resilience — Client-Side Retry

Server errors happen. Deployments cause brief 503s. Network blips happen on mobile (especially on African 3G networks). The question is: does the user see an error, or does the client silently retry?

const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));

const SAFE_METHODS = new Set(['GET', 'HEAD', 'OPTIONS']);

const fetchWithRetry = async (
  url: string,
  options: RequestInit,
  retries = 3,
): Promise<Response> => {
  const method = (options.method ?? 'GET').toUpperCase();

  for (let attempt = 0; attempt <= retries; attempt++) {
    try {
      const res = await fetch(url, options);
      // Only retry on server errors AND only for idempotent methods
      if (res.status >= 500 && attempt < retries && SAFE_METHODS.has(method)) {
        await sleep(Math.pow(2, attempt) * 1000); // 1s, 2s, 4s
        continue;
      }
      return res;
    } catch (error) {
      // Network errors: retry safe methods only
      if (attempt === retries || !SAFE_METHODS.has(method)) throw error;
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }
  throw new Error('fetch failed after retries');
};
Enter fullscreen mode Exit fullscreen mode

This wraps every API call in the web app. The retry logic is invisible to the rest of the codebase — apiFetch calls fetchWithRetry instead of fetch, and everything else stays the same.

Why only safe methods? Retrying a POST /bookings or POST /payments on a 503 could create duplicate charges. Only GET, HEAD, and OPTIONS are guaranteed idempotent — retrying them can never cause side effects. For mutations, the client surfaces the error immediately and lets the user decide.

Why exponential backoff? If the server is overloaded, retrying immediately makes it worse. Waiting 1s, then 2s, then 4s gives the server time to recover. If it doesn't recover in ~7 seconds total, the error surfaces to the user.

Why only on 5xx? A 400 (bad request) or 401 (unauthorized) won't succeed on retry — the request itself is wrong. Only server errors (500, 502, 503) and network failures are worth retrying.

I also extended TanStack Query's gcTime from 5 minutes to 15 minutes:

const queryClient = new QueryClient({
  defaultOptions: {
    queries: {
      gcTime: 15 * 60 * 1000, // 15 minutes
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

This means navigating back to a previously-loaded page shows cached data instantly while refetching in the background. For a marketplace where listings don't change by the second, 15 minutes of client cache is perfectly acceptable.

šŸ’° Cost: $0. It's 20 lines of code.


šŸ”Œ Layer 6: Graceful Shutdown — The Deploy That Doesn't Kill Connections

Without graceful shutdown, deploying a new version works like this:

  1. App Runner starts new container
  2. App Runner sends SIGTERM to old container
  3. Old container dies immediately šŸ’€
  4. Active HTTP requests get connection reset
  5. Socket.IO connections drop
  6. Database connections leak (pool not drained)
  7. Users see errors for 5-10 seconds

With graceful shutdown:

const shutdown = async (signal: string) => {
  console.log(`${signal} received, starting graceful shutdown...`);

  // 1ļøāƒ£ Stop accepting new HTTP connections
  server.close(() => {
    console.log('[shutdown] HTTP server closed');
  });

  // 2ļøāƒ£ Close WebSocket connections (clients auto-reconnect)
  const io = getIo();
  if (io) {
    io.close(() => {
      console.log('[shutdown] Socket.IO server closed');
    });
  }

  // 3ļøāƒ£ Drain the database connection pool
  if (pool) {
    await pool.end();
    console.log('[shutdown] DB pool drained');
  }

  // 4ļøāƒ£ Force exit if cleanup takes too long
  setTimeout(() => {
    console.error('[shutdown] Forced exit after timeout');
    process.exit(1);
  }, 30_000).unref();
};

process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));
Enter fullscreen mode Exit fullscreen mode

The .unref() on the timeout is critical — without it, the timeout itself keeps the event loop alive, preventing the process from exiting naturally if cleanup finishes before 30 seconds.

šŸ’° Cost: $0. Zero-downtime deploys shouldn't cost money.


šŸ“” Layer 7: Uptime Monitoring — Better Stack Over AWS CloudWatch

CloudWatch can monitor endpoints, but configuring alarms requires navigating a maze of SNS topics, CloudWatch Alarm configurations, and IAM permissions. Better Stack (formerly BetterUptime) does the same thing in 30 seconds.

I set up three monitors:

  • šŸ„ Backend health: GET /health every 3 minutes — catches server crashes, DB disconnects, cache failures
  • šŸ“” API response: GET /listings/public?limit=1 every 3 minutes — catches application-level issues (broken queries, ORM errors)
  • 🌐 Web homepage: GET https://your-domain.com every 3 minutes — catches DNS, SSL, and Vercel deployment issues

The health endpoint itself is a powerful diagnostic tool:

router.get('/', async (_req, res) => {
  const dbHealth = await getDbHealth();
  const cacheHealth = await getCacheHealth();
  const mem = process.memoryUsage();

  res.json({
    status: 'ok',
    db: dbHealth,          // āœ… or āŒ
    cache: cacheHealth,    // āœ… or ā­ļø (skipped if not configured)
    memory: {
      rss: Math.round(mem.rss / 1024 / 1024),
      heapUsed: Math.round(mem.heapUsed / 1024 / 1024),
      heapTotal: Math.round(mem.heapTotal / 1024 / 1024),
    },
    uptime: Math.round(process.uptime()),
    timestamp: new Date().toISOString(),
  });
});
Enter fullscreen mode Exit fullscreen mode

Is the database connected? Is Redis connected? How much memory is the process using? How long has it been running? One curl gives you the full picture. No log diving, no dashboard hopping.

šŸ’° Cost: $0 (free tier: 10 monitors, 3-minute intervals).


šŸ“± Layer 8: Mobile Performance — The Stuff Users Feel

Mobile is where infrastructure decisions become visceral. A 200ms delay on web is invisible. A 200ms delay on mobile feels laggy.

šŸ–¼ļø Image Caching with expo-image

Every <Image> component that loads from the network should cache to disk:

<Image
  source={{ uri: listing.photos[0].url }}
  style={styles.image}
  cachePolicy="memory-disk"
/>
Enter fullscreen mode Exit fullscreen mode

memory-disk means: check memory cache first (instant ⚔), then disk cache (fast šŸ’Ø), then network (slow 🐌). On revisit, images load in <16ms instead of 200-800ms.

I applied this to every expo-image instance across the entire app — explore screen, listing detail, account, lifestyle, wishlist, checkout. It's a one-prop-per-image change that users immediately feel.

šŸ—œļø Response Compression

import compression from 'compression';
app.use(compression());
Enter fullscreen mode Exit fullscreen mode

One line. JSON responses that were 50KB now transmit as 8KB. On 3G connections (common in the target market), this is the difference between a 2-second load and a 0.3-second load. For users in Lagos on MTN, that one line of code is worth more than any fancy architecture diagram.


šŸŒ Layer 9: CDN — CloudFront for Asset Delivery

This is where AWS genuinely earns its keep. No other CDN integrates with S3 as seamlessly.

Before CloudFront, every listing card in a search result triggered a presigned URL generation on the backend. With 20 listings and 3 photos each, that's 60 S3 getSignedUrl calls per page load — each generating a cryptographic signature. That's CPU time on every request for assets that don't change.

CloudFront replaces all of them with static URLs:

# āŒ Before: backend generates signed URL per request
https://my-bucket.s3.eu-north-1.amazonaws.com/photos/abc123.jpg?X-Amz-Signature=...&X-Amz-Expires=3600

# āœ… After: static CDN URL, cached at edge for 24 hours
https://d1xxxxx.cloudfront.net/photos/abc123.jpg
Enter fullscreen mode Exit fullscreen mode

Setup:

  1. Create a CloudFront distribution with your S3 bucket as the origin
  2. Use Origin Access Control (OAC) — not the legacy OAI. OAC uses IAM-based policies and supports SSE-KMS
  3. Update the S3 bucket policy to allow only CloudFront access (S3 is no longer public)
  4. Set cache TTL to 24 hours for images (they don't change after upload)
  5. Enable CORS headers for cross-origin image loading

The backend now returns CDN URLs instead of presigned URLs for public assets:

// āŒ Before (AWS SDK v3 presigned URLs — generated per-request)
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';
import { GetObjectCommand } from '@aws-sdk/client-s3';
const url = await getSignedUrl(s3Client, new GetObjectCommand({ Bucket, Key }), { expiresIn: 3600 });

// āœ… After — for public images (listing photos, avatars)
const url = `https://${process.env.CLOUDFRONT_DOMAIN}/${key}`;
Enter fullscreen mode Exit fullscreen mode

Private assets (identity documents, payout details) still use presigned S3 URLs — those should never be cached at the edge. šŸ”’

šŸ“Š Result: Listing pages went from 60 backend getSignedUrl calls to zero. Images load from the nearest edge location (Lagos, Cape Town, Johannesburg, Nairobi) instead of making a round trip to eu-north-1. First-byte time for images dropped from ~400ms to ~50ms for users in West Africa.

šŸ’° Cost: First 1TB/month of transfer is free (Always Free tier). After that, ~$0.110/GB for Africa edge locations ($0.085/GB for US/EU). For a marketplace with 10K users serving images from Lagos and Johannesburg, you're well within the free tier.


šŸ’° The Cost Math

Here's what this entire infrastructure costs at different user scales:

Service 100 users 10K users 100K users
⚔ Upstash Redis $0 $0 ~$10/mo
🚨 Sentry $0 $0 $0*
šŸ“Š PostHog $0 $0 ~$0 (1M events free)**
šŸ“” Better Stack $0 $0 $0
šŸŒ CloudFront $0 ~$2/mo ~$15/mo
šŸ–„ļø App Runner ~$5/mo (idle) ~$15/mo ~$50/mo
🌐 Vercel $20/mo (Pro) $20/mo (Pro) $20/mo (Pro)
Total ~$25/mo ~$37/mo ~$95/mo

*May need paid tier at 100K+ if error volume exceeds 5K/month

**PostHog's free tier covers 1M events/month; an active marketplace at 100K users will likely exceed that and need a paid tier (low tens of dollars/month).

Note on Vercel pricing: Vercel's free Hobby tier is for non-commercial, personal projects only. Any client work or revenue-generating app needs the Pro plan ($20/mo). I start on Pro from day one to stay compliant with their ToS — it's worth it for the deployment experience and Next.js optimizations.

Compare this to an all-AWS stack: ElastiCache (~$12/mo), CloudWatch custom metrics + alarms (~$10/mo), X-Ray (~$5/mo at low volume), Amplify Hosting ($15/mo), plus a self-hosted analytics solution (engineering time + hosting) — you're looking at ~$60-80/month minimum at 100 users, with significantly more operational overhead. At 100K users, the gap widens further because managed instance pricing stays flat while your needs grow. šŸ“Š

The key insight: serverless and pay-per-use services cost nothing at low traffic. You don't pay for capacity you don't use. When traffic grows, costs grow linearly — not in $50/month steps of provisioned instances.


šŸ¤ Why Multi-Vendor Beats Single-Vendor

The client was an AWS person. Every problem looked like it had an AWS solution. And technically, AWS can do everything I did. But "can" and "should" are different.

Need AWS Solution What I Used Why
⚔ Caching ElastiCache (~$12/mo min) Upstash Redis ($0) HTTP-based, no VPC config, free tier
🚨 Error tracking CloudWatch + X-Ray Sentry ($0) 10x better DX, automatic grouping
šŸ“Š Analytics Pinpoint + custom PostHog ($0) Heatmaps, replay, flags — all in one
šŸ“” Monitoring CloudWatch Alarms Better Stack ($0) 30 seconds to set up vs. 30 minutes
šŸŒ CDN CloudFront CloudFront AWS wins here — best CDN for S3
šŸ–„ļø Compute App Runner App Runner AWS wins here — right tool for the job
🌐 Frontend Amplify Vercel Better Next.js support, faster deploys

AWS won where AWS is genuinely best (CDN, compute). But for developer tooling (error tracking, analytics, monitoring), purpose-built SaaS products are years ahead of AWS's bolted-on solutions.

The multi-vendor approach also reduces blast radius. If Upstash has an outage, your caching degrades but the app works. If Sentry goes down, you stop getting error reports but users are unaffected. No single vendor failure takes down the entire platform. That's not just good architecture — it's good business continuity. šŸ›”ļø


šŸŽÆ This Is Not One-Size-Fits-All

Every decision here was shaped by context:

  • šŸŒ Target market is Nigeria/Africa — 3G connections are common, so compression and CDN matter more than in markets with ubiquitous 5G
  • šŸ’° Startup budget — free tiers are load-bearing infrastructure, not nice-to-have
  • šŸ“±šŸŒšŸ–„ļø Three platforms — any tool that doesn't support web, mobile, and backend was automatically disqualified
  • šŸ‘¤ Solo developer — operational simplicity beats configurability. I chose tools that work with one API key, not tools that need IAM policies

If you're a VC-funded company with $500K ARR, sure, use Datadog and LaunchDarkly and a dedicated DevOps engineer. But if you're bootstrapping a marketplace in an emerging market, this stack gets you from 10 to 100,000 users without re-architecting and without burning runway on infrastructure.

The best infrastructure is the one you forget exists because it just works. šŸš€


šŸ”® What's Next

The infrastructure is production-ready. Three things on the near-term roadmap:

  1. šŸ“Ø Upstash QStash — background job processing for emails, push notifications, and payout processing. Currently these run synchronously in the request cycle. QStash calls your backend via HTTP on a schedule — no worker process, no message broker, just another serverless service that costs nothing at low volume.
  2. šŸ‘¤ PostHog user identification — calling posthog.identify() after login to link anonymous events to real users. This unlocks cohort analysis ("users who booked in Week 1 vs. Week 2") and per-user session replays for support debugging.
  3. šŸ—„ļø Database read replicas — when query volume outgrows a single PostgreSQL instance (the threshold depends on instance size, query complexity, and caching effectiveness — monitor your connection utilization and query latency), add a read replica and route read-heavy queries to it. The ORM and connection pool are already structured to make this a config change, not a rewrite.

The foundation is set. Everything from here is incremental. šŸ“ˆ


All code samples are from a production codebase serving real users across multiple African cities.

Stack: Node.js + Express 5, Next.js 15, React Native (Expo), PostgreSQL, Drizzle ORM, Socket.IO, AWS S3 + App Runner, Vercel, Upstash Redis, Sentry, PostHog, Better Stack.


šŸ“ Fact-Check Changelog

Corrections applied after publication review (verified via official docs and pricing pages):

# Original Claim Correction Source
1 Upstash free tier: "10,000 commands/day" 500K commands/month Upstash pricing
2 ElastiCache min: "$15/mo for cache.t3.micro" ~$12/mo for cache.t4g.micro (current smallest) ElastiCache pricing
3 Better Stack free: "5 monitors" 10 monitors Better Stack uptime
4 CloudFront Africa egress: "$0.085/GB" $0.110/GB for Africa ($0.085 is US/EU) CloudFront pricing
5 Mixpanel: no heatmaps, no session replay Mixpanel added both in 2025 (10K replays free/mo) Mixpanel session replay docs
6 Amplitude: no feature flags, no session replay Amplitude has both (flags free up to 50K MTUs) Amplitude feature flags
7 cacheDel('public-listings:*') — implied DEL supports globs Redis DEL only accepts exact keys; need SCAN+DEL Redis DEL docs
8 N+1 fix: per-conversation COUNT query Still N+1 — rewritten to single GROUP BY query —
9 fetchWithRetry retried all methods Dangerous for POST (payments/bookings) — now limited to safe methods only HTTP idempotency RFC 7231
10 ElastiCache "needs VPC peering" App Runner uses VPC Connector (not VPC peering) App Runner VPC docs
11 Vercel Hobby tier for commercial app Hobby is non-commercial only per ToS — use Pro ($20/mo) Vercel fair use policy
12 AWS SDK v2 s3.getSignedUrl('getObject', ...) v2 is deprecated; v3 uses getSignedUrl from @aws-sdk/s3-request-presigner AWS SDK v3 docs
13 "All-AWS stack: $150+/month minimum" More realistic: ~$60-80/mo at low traffic Calculated from individual service pricing pages
14 Sentry note implied errors + transactions share one 5K quota They're separate buckets: 5K errors plus a distinct 10K performance-units quota/month; tracesSampleRate affects only the performance-units quota Sentry pricing
15 "Scale up until ~50K users" Threshold is workload-dependent, not a fixed user count —
16 Indexes section showed 9 naive single-column indexes Replaced with actual shipped indexes: 1 partial + 2 composite (replacing what would otherwise be 8 single-column indexes). Includes EXPLAIN ANALYZE benchmarks (buffer reads, timing) Validated on 20K-listing synthetic dataset
17 N+1 fix showed readAt/senderId filter that doesn't match schema Rewritten to match actual implementation: DISTINCT ON for last message + JOIN+GROUP BY for unread counts using lastReadAt timestamps Correctness-tested across user/host/admin roles

Top comments (0)