I recently took a client's full-stack platform from "it works on my laptop" to production-grade infrastructure that can handle 100,000 users ā across a Node.js backend, Next.js web app, and React Native mobile app.
The client was an AWS-first developer. The instinct was to solve everything with AWS services. But the reality of a startup is that you need to move fast, keep costs near zero at launch, and scale without re-architecting. I convinced him to go multi-vendor ā picking the best tool for each job instead of forcing everything into one vendor's ecosystem and paying for features he didn't need.
This isn't a "use these 7 services" listicle. This is the real decision-making process behind each choice, the actual code I shipped, and the cost math that made it work.
š The Starting Point
The platform had:
- Backend: Express 5 + TypeScript + PostgreSQL (Drizzle ORM), deployed on AWS App Runner
- Web: Next.js (App Router) + React 19 + Tailwind, deployed on Vercel
- Mobile: React Native (Expo) + TypeScript
What it didn't have:
- ā No caching layer (every request hit the database)
- ā No error tracking (errors disappeared into CloudWatch logs nobody checked)
- ā No uptime monitoring (I'd learn about downtime from users)
- ā No analytics (zero insight into what users actually do)
- ā No CDN for assets (S3 presigned URLs generated per-request)
- ā No graceful shutdown (deploys killed active connections)
- ā No API resilience (one server hiccup = failed request)
- ā N+1 query patterns in critical paths
- ā No pagination on endpoints returning entire tables
The backend was doing real work ā 14 database tables, JWT auth with OTP, real-time messaging via Socket.IO, payment processing, S3 file uploads. But it was held together with hope. One traffic spike and the database pool would exhaust, the server would run out of memory and crash, and nobody would know until a user complained on WhatsApp.
āļø Scale Up vs. Scale Out: I Chose Both
Before touching any code, I had to answer the fundamental question: when traffic grows 10x, do you make the boxes bigger (scale up) or add more boxes (scale out)?
The answer: I architected for scale-out, but the database scales up first.
| Layer | Strategy | Why |
|---|---|---|
| š„ļø Backend (App Runner) | Scale out | Auto-adds containers based on concurrent requests. No instance to resize. |
| š Web (Vercel) | Scale out | Serverless functions ā each request gets isolated compute. |
| ā” Cache (Upstash Redis) | Scale out | Serverless, per-request pricing. No nodes to manage. |
| š CDN (CloudFront) | Scale out | 400+ edge locations. Distributed by design. |
| šļø Database (PostgreSQL) | Scale up, then out | Bigger instance first, then read replicas when query load demands it. |
| š± Mobile | N/A | Runs on user devices. I optimize, not scale. |
The principle: make the stateless layers horizontally scalable, and protect the single stateful layer (the database) with caching and query optimization so it doesn't need to scale until much later.
This is why caching was task #1, not #5.
šļø Layer 1: Database ā The Foundation Nobody Sees
Before adding any external service, I fixed the database layer. The fastest cache is a query that doesn't run.
Indexes Matched to Actual Query Shapes
The listings table is the most-read table in the system. Every homepage load, every search, every filter hits it. But there were zero indexes beyond the primary key.
Naively, you'd slap a single-column index on every WHERE clause column. I did that first ā then looked at the actual queries and realized most of those indexes would never be used. A status index on a column with only 4 possible values? Postgres will seq-scan that anyway. A city index when every query uses LIKE '%lagos%'? B-tree can't help there.
Instead, I matched indexes to the actual query shapes:
-- Partial index: only indexes published listings (80% of queries)
-- Eliminates both the filter AND the sort step for the public feed
CREATE INDEX listings_published_feed_idx
ON listings (published_at DESC)
WHERE status = 'published';
-- Composite indexes: lead column for equality, second for sort
-- "Show me host 7's bookings, newest first" ā one index scan, no sort
CREATE INDEX bookings_host_date_idx
ON bookings (host_id, start_date DESC);
CREATE INDEX bookings_user_date_idx
ON bookings (user_id, start_date DESC);
The partial index is the big win. The public feed query (WHERE status = 'published' ORDER BY published_at DESC LIMIT 20) went from scanning every row in the table to reading exactly 20 index entries. I benchmarked on a 20K-row synthetic dataset: buffer reads dropped from 649 to 22 (the database touches 97% fewer pages), and the sort step disappears entirely ā Postgres walks the index in order instead of sorting in memory. Wall-clock times went from 242ms to 0.14ms, though in production the public feed is cached with a 2-minute TTL so most requests never hit this query at all. The real win isn't a raw timing multiplier ā it's that the query stops scaling with table size. The composite booking indexes showed the same pattern: 903 ā 22 buffer reads for host bookings, 900 ā 22 for guest bookings.
These 3 indexes replace what would otherwise be 8 single-column indexes (one per filtered column). The design avoids redundant indexes rather than piling them on ā every index costs storage and a per-row write penalty, so the skill is matching the index to the query shape, not maximizing index count.
š° Cost: $0 on the bill, but not free in overhead ā every index adds storage and a write penalty per row insert/update. That's exactly why I used 3 targeted indexes instead of 8 redundant single-column ones. For a read-heavy marketplace the tradeoff is worth it, especially since the highest-write table (messages) is insert-only ā the most index-friendly write pattern there is.
Connection Pool Tuning
The default pg pool ships with max: 10 connections. For a server handling concurrent API requests and Socket.IO connections, that's a bottleneck waiting to happen.
export const pool = new Pool({
connectionString: config.db.url,
max: 20, // doubled from default
idleTimeoutMillis: 30_000, // release idle connections after 30s
connectionTimeoutMillis: 5_000, // fail fast if pool is exhausted
});
Why 20 and not 100? Because PostgreSQL itself has a max_connections limit (typically 100 on small managed instances, configurable higher on larger ones). If you have 3 App Runner containers each holding 20 connections, you're at 60 ā safely under the limit with headroom for admin connections. Going higher risks connection exhaustion at the database level, which is much harder to debug than a pool timeout. As you scale past ~5-10 App Runner instances, consider adding a connection pooler like PgBouncer or Supabase's built-in pooler to multiplex connections.
Pagination: Stop Returning Entire Tables
The public listings endpoint was doing SELECT * FROM listings WHERE status = 'published'. For 50 listings, fine. For 5,000? You're serializing megabytes of JSON per request. š¬
router.get('/', async (req, res) => {
const limit = Math.min(Math.max(Number(req.query.limit) || 20, 1), 50);
const offset = Math.max(Number(req.query.offset) || 0, 0);
// Count total for pagination metadata
const [{ count: total }] = await db
.select({ count: sql<number>`count(*)` })
.from(listings)
.where(eq(listings.status, 'published'));
const rows = await db
.select()
.from(listings)
.where(eq(listings.status, 'published'))
.orderBy(desc(listings.publishedAt))
.limit(limit)
.offset(offset);
res.json({
listings: rows,
total,
limit,
offset,
hasMore: offset + rows.length < total,
});
});
Why limit/offset and not cursor-based? For a marketplace listing page, users jump to arbitrary pages ("show me page 7"). Cursor-based pagination is better for infinite scroll (which the mobile app does), but the web app needs random access. I cap at 50 items per page ā that's the safety valve.
Caveat: OFFSET pagination degrades on deep pages ā OFFSET 10000 means the database still scans and discards 10,000 rows. For a marketplace with <10K published listings, this is fine. If listings grow past that, I'd switch the mobile app's infinite scroll to cursor-based (WHERE published_at < :lastSeen ORDER BY published_at DESC LIMIT 20) while keeping offset for the web's page-jump UI. The count(*) also runs on every uncached request ā for now the Redis cache absorbs this, but at scale I'd consider a materialized count or remove exact totals in favor of "has more" pagination.
š Killing the N+1 in Conversations
The conversations endpoint had two problems. First, it loaded every message for every conversation into memory just to find the last message and compute unread counts ā for a user with 30 conversations averaging 50 messages each, that's 1,500 rows loaded to produce 30 numbers. Second, after that was partially fixed, the unread count still ran one query per conversation in a loop (N+1).
Fix #1: Last message per conversation ā replaced "load all messages, deduplicate in JS" with a single DISTINCT ON query:
-- One query returns exactly one row per conversation: the latest non-null message
SELECT DISTINCT ON (conversation_id)
conversation_id, body
FROM messages
WHERE conversation_id IN (6, 7, 8, 9, ...)
AND body IS NOT NULL
ORDER BY conversation_id, created_at DESC
Fix #2: Unread counts ā replaced per-conversation loop with a single JOIN + GROUP BY:
// readAtCol resolves to the correct column based on caller's role
const readAtCol = auth.entityType === 'user'
? conversations.userLastReadAt
: auth.entityType === 'host'
? conversations.hostLastReadAt
: conversations.adminLastReadAt;
const unreadResults = await db
.select({
conversationId: messages.conversationId,
count: sql<number>`count(*)::int`,
})
.from(messages)
.innerJoin(conversations, eq(conversations.id, messages.conversationId))
.where(and(
inArray(messages.conversationId, conversationIds),
sql`${messages.createdAt} > coalesce(${readAtCol}, '1970-01-01'::timestamp)`
))
.groupBy(messages.conversationId);
The key design: unread tracking uses a lastReadAt timestamp per role on the conversation row (not a per-message readAt flag). The COALESCE to epoch handles the "never read" case ā if lastReadAt is NULL, every message counts as unread.
Both fixes together: N queries ā 2 queries, regardless of conversation count. On 2K conversations with 100K messages, each query runs in under 1ms.
š° Cost: $0. Better queries are free.
ā” Layer 2: Caching ā Upstash Redis Over ElastiCache
This was the first architecture decision where I diverged from the client's AWS instinct.
The AWS answer: ElastiCache (managed Redis). Minimum cost: ~$12/month for a cache.t4g.micro running 24/7, even if you serve 10 requests a day.
What I chose: Upstash Redis. Cost: free for the first 500K commands/month, then $0.2 per 100K commands. For a startup doing 500 requests/day (~15K commands/month), that's well within the free tier indefinitely.
Why Upstash over ElastiCache:
| Factor | ElastiCache | Upstash |
|---|---|---|
| š° Cost at 0 traffic | ~$12/mo | $0 |
| š° Cost at 100K req/day | ~$12/mo | ~$6/mo |
| š§ Provisioning | VPC, security groups, subnet groups | One API key |
| š Connection model | Persistent TCP (needs VPC Connector) | HTTP REST (works from anywhere) |
| š Scaling | Manual instance resize | Automatic |
The HTTP-based model is the key insight. App Runner runs in AWS's managed VPC ā connecting it to ElastiCache requires setting up a VPC Connector (additional config, and your ElastiCache must be in a VPC with compatible subnets). Upstash uses REST over HTTPS, so it works from anywhere with zero networking config.
The Cache Implementation
The cache utility is 53 lines and handles its own failures gracefully:
import { Redis } from '@upstash/redis';
let redis: Redis | null = null;
const getRedis = () => {
if (redis) return redis;
if (!config.cache.redisUrl || !config.cache.redisToken) return null;
redis = new Redis({ url: config.cache.redisUrl, token: config.cache.redisToken });
return redis;
};
export const cacheGet = async <T>(key: string): Promise<T | null> => {
const client = getRedis();
if (!client) return null;
try {
return await client.get<T>(key);
} catch {
return null; // cache miss on error ā DB still serves the request
}
};
export const cacheSet = async (
key: string, value: unknown, ttlSeconds: number
): Promise<void> => {
const client = getRedis();
if (!client) return;
try {
await client.set(key, value, { ex: ttlSeconds });
} catch {
// write-through failure ā not critical
}
};
export const cacheDel = async (...keys: string[]): Promise<void> => {
const client = getRedis();
if (!client || keys.length === 0) return;
try {
await client.del(...keys);
} catch {
// invalidation failure ā stale data expires via TTL
}
};
Three design decisions worth calling out:
š Lazy initialization: Redis client is created on first use, not at import time. If credentials aren't set, the app works normally ā just without caching.
š”ļø Silent failure: Every cache operation is wrapped in try/catch that returns null or no-ops. The cache is an optimization, not a dependency. If Upstash has an outage, your app slows down but doesn't crash.
ā° TTL-based expiry as a safety net: Even if
cacheDelfails (network blip during a write), stale data expires in 2-3 minutes via TTL. This is the cache-aside pattern ā you don't need perfect invalidation if you have TTLs.
Cache-Aside in Practice
// In public-listings.router.ts
const cacheKey = `public-listings:${JSON.stringify({ limit, offset, city, minPrice, maxPrice })}`;
// 1ļøā£ Try cache first
const cached = await cacheGet(cacheKey);
if (cached) return res.json(cached);
// 2ļøā£ Cache miss ā query database
const rows = await db.select()...;
const response = { listings: rows, total, limit, offset, hasMore };
// 3ļøā£ Populate cache for next request (2 min TTL)
await cacheSet(cacheKey, response, 120);
return res.json(response);
And on the write side, I invalidate all cached listing pages when data changes. Since Redis DEL doesn't support glob patterns, I use Upstash's SCAN + DEL approach:
// In listing.router.ts ā PATCH, DELETE, publish, draft
// Invalidate all public-listing and home-feed cache entries
const keys = await scanKeys('public-listings:*', 'home-feed:*');
if (keys.length > 0) await cacheDel(...keys);
The scanKeys helper uses Upstash's scan command to find keys matching a pattern, then cacheDel removes them in one batch. This is important because Redis's DEL command only accepts exact key names ā you can't pass DEL public-listings:* and expect it to work like a glob. The TTL-based expiry acts as a safety net if SCAN misses anything.
š Result: Public listing pages that took 200-400ms on first load now serve in <50ms on cache hit. The home feed (the most-hit endpoint) went from 350ms to 30ms. That's an 8x improvement for zero monthly cost.
šØ Layer 3: Error Tracking ā Sentry Across Three Platforms
Before this, errors went to console.error and CloudWatch. CloudWatch is where errors go to die ā the interface is hostile, there's no grouping, no stack traces from minified code, and you can't see which user was affected.
Why Sentry over AWS CloudWatch/X-Ray:
| Factor | CloudWatch/X-Ray | Sentry |
|---|---|---|
| š Error grouping | Manual log filtering | Automatic fingerprinting |
| š Stack traces (minified) | Raw text | Source map support |
| š¤ User context | DIY | Built-in |
| š Alert rules | CloudWatch Alarms (complex) | 2 clicks |
| š° Cost | Pay per log GB ingested | Free (5K errors/mo) |
| ā±ļø Setup time | Hours (IAM, log groups, alarms) | Minutes |
I set up three Sentry projects ā one per platform ā all under one organization.
š„ļø Backend: Express Error Handler
// instrument.ts ā must be the FIRST import in server.ts
import * as Sentry from '@sentry/node';
Sentry.init({
dsn: process.env.SENTRY_DSN ?? '',
environment: process.env.NODE_ENV ?? 'development',
tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.2 : 1.0,
sendDefaultPii: false,
});
// app.ts ā after all routes, before 404 handler
Sentry.setupExpressErrorHandler(app);
That's it. Two lines of config, one line in the Express pipeline. Every unhandled exception now shows up in Sentry with the request URL, headers, user context, and a full stack trace. āØ
š Web: Next.js with Source Maps
// sentry.client.config.ts
import * as Sentry from "@sentry/nextjs";
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
tracesSampleRate: process.env.NODE_ENV === "production" ? 0.2 : 1.0,
});
// next.config.ts
import { withSentryConfig } from "@sentry/nextjs";
export default withSentryConfig(nextConfig, {
silent: true,
org: "your-sentry-org",
project: "your-web-project",
});
The withSentryConfig wrapper uploads source maps during build so that production errors show original TypeScript line numbers, not minified garbage.
The global-error.tsx boundary catches React rendering errors:
"use client";
import * as Sentry from "@sentry/nextjs";
import { useEffect } from "react";
export default function GlobalError({ error, reset }: {
error: Error & { digest?: string };
reset: () => void;
}) {
useEffect(() => { Sentry.captureException(error); }, [error]);
return (
<html><body>
<h2>Something went wrong</h2>
<button onClick={() => reset()}>Try again</button>
</body></html>
);
}
š± Mobile: React Native
// _layout.tsx
import * as Sentry from '@sentry/react-native';
Sentry.init({
dsn: process.env.EXPO_PUBLIC_SENTRY_DSN,
tracesSampleRate: __DEV__ ? 1.0 : 0.2,
});
// Wrap root component
export default Sentry.wrap(RootLayout);
The tracesSampleRate tradeoff: I set it to 0.2 in production (sample 20% of transactions for performance monitoring) and 1.0 in development. At 100K users, 20% sampling still gives you statistically significant performance data without burning through Sentry's free tier quota. If you're debugging a specific performance issue, temporarily bump it to 1.0, fix it, and drop it back.
š° Cost: $0 (free tier: 5K errors/month shared across all projects, 30-day retention). Note: errors and performance transactions are billed against separate quotas ā 5K errors plus a distinct 10K performance-units allowance. The tracesSampleRate setting governs the performance-units quota, so sampling traces protects your transaction allowance without ever eating into your 5K error capacity.
š Layer 4: Analytics ā PostHog Over Mixpanel/Amplitude
The client wanted heatmaps and session replays to understand user behavior. Also needed feature flags to push promo badges to mobile without app store updates.
Why PostHog:
| Feature | Mixpanel | Amplitude | PostHog |
|---|---|---|---|
| š„ Heatmaps | ā (added 2025) | ā | ā |
| š„ Session replay | ā (10K free/mo) | ā (limited) | ā (5K free/mo) |
| š© Feature flags | ā | ā (free tier) | ā |
| š± Mobile SDK | ā | ā | ā |
| š° Free tier | 20M events | 50K MTUs | 1M events |
| š Self-host option | ā | ā | ā |
| šÆ Single tool for everything | ā | ā | ā |
The landscape has shifted ā Mixpanel and Amplitude both added session replay recently. But PostHog still wins here because it bundles all of this (analytics, heatmaps, replay, feature flags, A/B testing) under one SDK and one dashboard with no separate billing per product. Mixpanel still doesn't do feature flags, Amplitude's replay is limited, and neither offers self-hosting. For a startup that needs all these capabilities without managing multiple vendor contracts, PostHog is the only single-vendor option.
š Web: Next.js App Router Integration
Next.js App Router uses client-side navigation ā the browser doesn't do full page loads when you click links. This means the PostHog SDK's automatic pageview capture misses most navigation. You need a custom PageViewTracker:
"use client";
import posthog from "posthog-js";
import { PostHogProvider as PHProvider, usePostHog } from "posthog-js/react";
import { useEffect } from "react";
import { usePathname, useSearchParams } from "next/navigation";
if (typeof window !== "undefined" && process.env.NEXT_PUBLIC_POSTHOG_KEY) {
posthog.init(process.env.NEXT_PUBLIC_POSTHOG_KEY, {
api_host: process.env.NEXT_PUBLIC_POSTHOG_HOST,
person_profiles: "identified_only",
capture_pageview: false, // disable auto ā I capture manually
capture_pageleave: true,
});
}
function PageViewTracker() {
const pathname = usePathname();
const searchParams = useSearchParams();
const ph = usePostHog();
useEffect(() => {
if (pathname && ph) {
let url = window.origin + pathname;
if (searchParams.toString()) url += `?${searchParams.toString()}`;
ph.capture("$pageview", { $current_url: url });
}
}, [pathname, searchParams, ph]);
return null;
}
export function PostHogProvider({ children }: { children: React.ReactNode }) {
if (!process.env.NEXT_PUBLIC_POSTHOG_KEY) return <>{children}</>;
return (
<PHProvider client={posthog}>
<PageViewTracker />
{children}
</PHProvider>
);
}
Key decisions:
-
person_profiles: "identified_only"ā don't create user profiles for anonymous visitors. This drastically reduces event volume (and cost). You only track identified users who've logged in. -
capture_pageleave: trueā know when users bounce. Combined with pageviews, this gives you session duration without session replay. - š”ļø Graceful degradation ā if the PostHog key isn't set, the provider renders children without the SDK. No errors, no broken UI.
š± Mobile: React Native with Autocapture
import { PostHogProvider } from 'posthog-react-native';
export function AppProviders({ children }: { children: React.ReactNode }) {
return (
<GestureHandlerRootView style={styles.root}>
<PostHogProvider
apiKey={process.env.EXPO_PUBLIC_POSTHOG_KEY}
options={{ host: process.env.EXPO_PUBLIC_POSTHOG_HOST }}
autocapture
>
<QueryProvider>
<SafeAreaProvider>
{children}
</SafeAreaProvider>
</QueryProvider>
</PostHogProvider>
</GestureHandlerRootView>
);
}
The autocapture prop is the magic ā it automatically captures screen views, button taps, and navigation events without any manual instrumentation. For the 80% of analytics questions ("which screen do users visit most?", "where do they drop off in onboarding?"), autocapture is enough.
š© Feature Flags for Mobile Promo Badges
This was a specific client requirement: push promotional badges to the mobile app without going through App Store review. š
In PostHog, create a feature flag promo-summer-sale with a JSON payload:
{ "badge": "20% OFF", "color": "#FF6B35", "expires": "2026-08-31" }
In the mobile app:
import { useFeatureFlag, useFeatureFlagPayload } from 'posthog-react-native';
function ListingCard({ listing }) {
const showPromo = useFeatureFlag('promo-summer-sale');
const promoData = useFeatureFlagPayload('promo-summer-sale');
return (
<View>
{showPromo && (
<Badge color={promoData.color}>{promoData.badge}</Badge>
)}
{/* rest of card */}
</View>
);
}
Toggle the flag in PostHog's dashboard ā the badge appears or disappears for all users within minutes. No app update, no App Store review, no deployment. That's the kind of operational leverage that separates "I shipped a feature" from "I can run a business." šÆ
š° Cost: Free (1M events/month, 5K session recordings/month).
š Layer 5: API Resilience ā Client-Side Retry
Server errors happen. Deployments cause brief 503s. Network blips happen on mobile (especially on African 3G networks). The question is: does the user see an error, or does the client silently retry?
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));
const SAFE_METHODS = new Set(['GET', 'HEAD', 'OPTIONS']);
const fetchWithRetry = async (
url: string,
options: RequestInit,
retries = 3,
): Promise<Response> => {
const method = (options.method ?? 'GET').toUpperCase();
for (let attempt = 0; attempt <= retries; attempt++) {
try {
const res = await fetch(url, options);
// Only retry on server errors AND only for idempotent methods
if (res.status >= 500 && attempt < retries && SAFE_METHODS.has(method)) {
await sleep(Math.pow(2, attempt) * 1000); // 1s, 2s, 4s
continue;
}
return res;
} catch (error) {
// Network errors: retry safe methods only
if (attempt === retries || !SAFE_METHODS.has(method)) throw error;
await sleep(Math.pow(2, attempt) * 1000);
}
}
throw new Error('fetch failed after retries');
};
This wraps every API call in the web app. The retry logic is invisible to the rest of the codebase ā apiFetch calls fetchWithRetry instead of fetch, and everything else stays the same.
Why only safe methods? Retrying a POST /bookings or POST /payments on a 503 could create duplicate charges. Only GET, HEAD, and OPTIONS are guaranteed idempotent ā retrying them can never cause side effects. For mutations, the client surfaces the error immediately and lets the user decide.
Why exponential backoff? If the server is overloaded, retrying immediately makes it worse. Waiting 1s, then 2s, then 4s gives the server time to recover. If it doesn't recover in ~7 seconds total, the error surfaces to the user.
Why only on 5xx? A 400 (bad request) or 401 (unauthorized) won't succeed on retry ā the request itself is wrong. Only server errors (500, 502, 503) and network failures are worth retrying.
I also extended TanStack Query's gcTime from 5 minutes to 15 minutes:
const queryClient = new QueryClient({
defaultOptions: {
queries: {
gcTime: 15 * 60 * 1000, // 15 minutes
},
},
});
This means navigating back to a previously-loaded page shows cached data instantly while refetching in the background. For a marketplace where listings don't change by the second, 15 minutes of client cache is perfectly acceptable.
š° Cost: $0. It's 20 lines of code.
š Layer 6: Graceful Shutdown ā The Deploy That Doesn't Kill Connections
Without graceful shutdown, deploying a new version works like this:
- App Runner starts new container
- App Runner sends SIGTERM to old container
- Old container dies immediately š
- Active HTTP requests get connection reset
- Socket.IO connections drop
- Database connections leak (pool not drained)
- Users see errors for 5-10 seconds
With graceful shutdown:
const shutdown = async (signal: string) => {
console.log(`${signal} received, starting graceful shutdown...`);
// 1ļøā£ Stop accepting new HTTP connections
server.close(() => {
console.log('[shutdown] HTTP server closed');
});
// 2ļøā£ Close WebSocket connections (clients auto-reconnect)
const io = getIo();
if (io) {
io.close(() => {
console.log('[shutdown] Socket.IO server closed');
});
}
// 3ļøā£ Drain the database connection pool
if (pool) {
await pool.end();
console.log('[shutdown] DB pool drained');
}
// 4ļøā£ Force exit if cleanup takes too long
setTimeout(() => {
console.error('[shutdown] Forced exit after timeout');
process.exit(1);
}, 30_000).unref();
};
process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));
The .unref() on the timeout is critical ā without it, the timeout itself keeps the event loop alive, preventing the process from exiting naturally if cleanup finishes before 30 seconds.
š° Cost: $0. Zero-downtime deploys shouldn't cost money.
š” Layer 7: Uptime Monitoring ā Better Stack Over AWS CloudWatch
CloudWatch can monitor endpoints, but configuring alarms requires navigating a maze of SNS topics, CloudWatch Alarm configurations, and IAM permissions. Better Stack (formerly BetterUptime) does the same thing in 30 seconds.
I set up three monitors:
-
š„ Backend health:
GET /healthevery 3 minutes ā catches server crashes, DB disconnects, cache failures -
š” API response:
GET /listings/public?limit=1every 3 minutes ā catches application-level issues (broken queries, ORM errors) -
š Web homepage:
GET https://your-domain.comevery 3 minutes ā catches DNS, SSL, and Vercel deployment issues
The health endpoint itself is a powerful diagnostic tool:
router.get('/', async (_req, res) => {
const dbHealth = await getDbHealth();
const cacheHealth = await getCacheHealth();
const mem = process.memoryUsage();
res.json({
status: 'ok',
db: dbHealth, // ā
or ā
cache: cacheHealth, // ā
or āļø (skipped if not configured)
memory: {
rss: Math.round(mem.rss / 1024 / 1024),
heapUsed: Math.round(mem.heapUsed / 1024 / 1024),
heapTotal: Math.round(mem.heapTotal / 1024 / 1024),
},
uptime: Math.round(process.uptime()),
timestamp: new Date().toISOString(),
});
});
Is the database connected? Is Redis connected? How much memory is the process using? How long has it been running? One curl gives you the full picture. No log diving, no dashboard hopping.
š° Cost: $0 (free tier: 10 monitors, 3-minute intervals).
š± Layer 8: Mobile Performance ā The Stuff Users Feel
Mobile is where infrastructure decisions become visceral. A 200ms delay on web is invisible. A 200ms delay on mobile feels laggy.
š¼ļø Image Caching with expo-image
Every <Image> component that loads from the network should cache to disk:
<Image
source={{ uri: listing.photos[0].url }}
style={styles.image}
cachePolicy="memory-disk"
/>
memory-disk means: check memory cache first (instant ā”), then disk cache (fast šØ), then network (slow š). On revisit, images load in <16ms instead of 200-800ms.
I applied this to every expo-image instance across the entire app ā explore screen, listing detail, account, lifestyle, wishlist, checkout. It's a one-prop-per-image change that users immediately feel.
šļø Response Compression
import compression from 'compression';
app.use(compression());
One line. JSON responses that were 50KB now transmit as 8KB. On 3G connections (common in the target market), this is the difference between a 2-second load and a 0.3-second load. For users in Lagos on MTN, that one line of code is worth more than any fancy architecture diagram.
š Layer 9: CDN ā CloudFront for Asset Delivery
This is where AWS genuinely earns its keep. No other CDN integrates with S3 as seamlessly.
Before CloudFront, every listing card in a search result triggered a presigned URL generation on the backend. With 20 listings and 3 photos each, that's 60 S3 getSignedUrl calls per page load ā each generating a cryptographic signature. That's CPU time on every request for assets that don't change.
CloudFront replaces all of them with static URLs:
# ā Before: backend generates signed URL per request
https://my-bucket.s3.eu-north-1.amazonaws.com/photos/abc123.jpg?X-Amz-Signature=...&X-Amz-Expires=3600
# ā
After: static CDN URL, cached at edge for 24 hours
https://d1xxxxx.cloudfront.net/photos/abc123.jpg
Setup:
- Create a CloudFront distribution with your S3 bucket as the origin
- Use Origin Access Control (OAC) ā not the legacy OAI. OAC uses IAM-based policies and supports SSE-KMS
- Update the S3 bucket policy to allow only CloudFront access (S3 is no longer public)
- Set cache TTL to 24 hours for images (they don't change after upload)
- Enable CORS headers for cross-origin image loading
The backend now returns CDN URLs instead of presigned URLs for public assets:
// ā Before (AWS SDK v3 presigned URLs ā generated per-request)
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';
import { GetObjectCommand } from '@aws-sdk/client-s3';
const url = await getSignedUrl(s3Client, new GetObjectCommand({ Bucket, Key }), { expiresIn: 3600 });
// ā
After ā for public images (listing photos, avatars)
const url = `https://${process.env.CLOUDFRONT_DOMAIN}/${key}`;
Private assets (identity documents, payout details) still use presigned S3 URLs ā those should never be cached at the edge. š
š Result: Listing pages went from 60 backend getSignedUrl calls to zero. Images load from the nearest edge location (Lagos, Cape Town, Johannesburg, Nairobi) instead of making a round trip to eu-north-1. First-byte time for images dropped from ~400ms to ~50ms for users in West Africa.
š° Cost: First 1TB/month of transfer is free (Always Free tier). After that, ~$0.110/GB for Africa edge locations ($0.085/GB for US/EU). For a marketplace with 10K users serving images from Lagos and Johannesburg, you're well within the free tier.
š° The Cost Math
Here's what this entire infrastructure costs at different user scales:
| Service | 100 users | 10K users | 100K users |
|---|---|---|---|
| ā” Upstash Redis | $0 | $0 | ~$10/mo |
| šØ Sentry | $0 | $0 | $0* |
| š PostHog | $0 | $0 | ~$0 (1M events free)** |
| š” Better Stack | $0 | $0 | $0 |
| š CloudFront | $0 | ~$2/mo | ~$15/mo |
| š„ļø App Runner | ~$5/mo (idle) | ~$15/mo | ~$50/mo |
| š Vercel | $20/mo (Pro) | $20/mo (Pro) | $20/mo (Pro) |
| Total | ~$25/mo | ~$37/mo | ~$95/mo |
*May need paid tier at 100K+ if error volume exceeds 5K/month
**PostHog's free tier covers 1M events/month; an active marketplace at 100K users will likely exceed that and need a paid tier (low tens of dollars/month).
Note on Vercel pricing: Vercel's free Hobby tier is for non-commercial, personal projects only. Any client work or revenue-generating app needs the Pro plan ($20/mo). I start on Pro from day one to stay compliant with their ToS ā it's worth it for the deployment experience and Next.js optimizations.
Compare this to an all-AWS stack: ElastiCache (~$12/mo), CloudWatch custom metrics + alarms (~$10/mo), X-Ray (~$5/mo at low volume), Amplify Hosting ($15/mo), plus a self-hosted analytics solution (engineering time + hosting) ā you're looking at ~$60-80/month minimum at 100 users, with significantly more operational overhead. At 100K users, the gap widens further because managed instance pricing stays flat while your needs grow. š
The key insight: serverless and pay-per-use services cost nothing at low traffic. You don't pay for capacity you don't use. When traffic grows, costs grow linearly ā not in $50/month steps of provisioned instances.
š¤ Why Multi-Vendor Beats Single-Vendor
The client was an AWS person. Every problem looked like it had an AWS solution. And technically, AWS can do everything I did. But "can" and "should" are different.
| Need | AWS Solution | What I Used | Why |
|---|---|---|---|
| ā” Caching | ElastiCache (~$12/mo min) | Upstash Redis ($0) | HTTP-based, no VPC config, free tier |
| šØ Error tracking | CloudWatch + X-Ray | Sentry ($0) | 10x better DX, automatic grouping |
| š Analytics | Pinpoint + custom | PostHog ($0) | Heatmaps, replay, flags ā all in one |
| š” Monitoring | CloudWatch Alarms | Better Stack ($0) | 30 seconds to set up vs. 30 minutes |
| š CDN | CloudFront | CloudFront | AWS wins here ā best CDN for S3 |
| š„ļø Compute | App Runner | App Runner | AWS wins here ā right tool for the job |
| š Frontend | Amplify | Vercel | Better Next.js support, faster deploys |
AWS won where AWS is genuinely best (CDN, compute). But for developer tooling (error tracking, analytics, monitoring), purpose-built SaaS products are years ahead of AWS's bolted-on solutions.
The multi-vendor approach also reduces blast radius. If Upstash has an outage, your caching degrades but the app works. If Sentry goes down, you stop getting error reports but users are unaffected. No single vendor failure takes down the entire platform. That's not just good architecture ā it's good business continuity. š”ļø
šÆ This Is Not One-Size-Fits-All
Every decision here was shaped by context:
- š Target market is Nigeria/Africa ā 3G connections are common, so compression and CDN matter more than in markets with ubiquitous 5G
- š° Startup budget ā free tiers are load-bearing infrastructure, not nice-to-have
- š±šš„ļø Three platforms ā any tool that doesn't support web, mobile, and backend was automatically disqualified
- š¤ Solo developer ā operational simplicity beats configurability. I chose tools that work with one API key, not tools that need IAM policies
If you're a VC-funded company with $500K ARR, sure, use Datadog and LaunchDarkly and a dedicated DevOps engineer. But if you're bootstrapping a marketplace in an emerging market, this stack gets you from 10 to 100,000 users without re-architecting and without burning runway on infrastructure.
The best infrastructure is the one you forget exists because it just works. š
š® What's Next
The infrastructure is production-ready. Three things on the near-term roadmap:
- šØ Upstash QStash ā background job processing for emails, push notifications, and payout processing. Currently these run synchronously in the request cycle. QStash calls your backend via HTTP on a schedule ā no worker process, no message broker, just another serverless service that costs nothing at low volume.
-
š¤ PostHog user identification ā calling
posthog.identify()after login to link anonymous events to real users. This unlocks cohort analysis ("users who booked in Week 1 vs. Week 2") and per-user session replays for support debugging. - šļø Database read replicas ā when query volume outgrows a single PostgreSQL instance (the threshold depends on instance size, query complexity, and caching effectiveness ā monitor your connection utilization and query latency), add a read replica and route read-heavy queries to it. The ORM and connection pool are already structured to make this a config change, not a rewrite.
The foundation is set. Everything from here is incremental. š
All code samples are from a production codebase serving real users across multiple African cities.
Stack: Node.js + Express 5, Next.js 15, React Native (Expo), PostgreSQL, Drizzle ORM, Socket.IO, AWS S3 + App Runner, Vercel, Upstash Redis, Sentry, PostHog, Better Stack.
š Fact-Check Changelog
Corrections applied after publication review (verified via official docs and pricing pages):
| # | Original Claim | Correction | Source |
|---|---|---|---|
| 1 | Upstash free tier: "10,000 commands/day" | 500K commands/month | Upstash pricing |
| 2 | ElastiCache min: "$15/mo for cache.t3.micro" | ~$12/mo for cache.t4g.micro (current smallest) | ElastiCache pricing |
| 3 | Better Stack free: "5 monitors" | 10 monitors | Better Stack uptime |
| 4 | CloudFront Africa egress: "$0.085/GB" | $0.110/GB for Africa ($0.085 is US/EU) | CloudFront pricing |
| 5 | Mixpanel: no heatmaps, no session replay | Mixpanel added both in 2025 (10K replays free/mo) | Mixpanel session replay docs |
| 6 | Amplitude: no feature flags, no session replay | Amplitude has both (flags free up to 50K MTUs) | Amplitude feature flags |
| 7 |
cacheDel('public-listings:*') ā implied DEL supports globs |
Redis DEL only accepts exact keys; need SCAN+DEL | Redis DEL docs |
| 8 | N+1 fix: per-conversation COUNT query | Still N+1 ā rewritten to single GROUP BY query | ā |
| 9 | fetchWithRetry retried all methods | Dangerous for POST (payments/bookings) ā now limited to safe methods only | HTTP idempotency RFC 7231 |
| 10 | ElastiCache "needs VPC peering" | App Runner uses VPC Connector (not VPC peering) | App Runner VPC docs |
| 11 | Vercel Hobby tier for commercial app | Hobby is non-commercial only per ToS ā use Pro ($20/mo) | Vercel fair use policy |
| 12 | AWS SDK v2 s3.getSignedUrl('getObject', ...)
|
v2 is deprecated; v3 uses getSignedUrl from @aws-sdk/s3-request-presigner
|
AWS SDK v3 docs |
| 13 | "All-AWS stack: $150+/month minimum" | More realistic: ~$60-80/mo at low traffic | Calculated from individual service pricing pages |
| 14 | Sentry note implied errors + transactions share one 5K quota | They're separate buckets: 5K errors plus a distinct 10K performance-units quota/month; tracesSampleRate affects only the performance-units quota | Sentry pricing |
| 15 | "Scale up until ~50K users" | Threshold is workload-dependent, not a fixed user count | ā |
| 16 | Indexes section showed 9 naive single-column indexes | Replaced with actual shipped indexes: 1 partial + 2 composite (replacing what would otherwise be 8 single-column indexes). Includes EXPLAIN ANALYZE benchmarks (buffer reads, timing) | Validated on 20K-listing synthetic dataset |
| 17 | N+1 fix showed readAt/senderId filter that doesn't match schema |
Rewritten to match actual implementation: DISTINCT ON for last message + JOIN+GROUP BY for unread counts using lastReadAt timestamps |
Correctness-tested across user/host/admin roles |
Top comments (0)