DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The Future of SRE: What the Next 5 Years Look Like

The Future of SRE: What the Next 5 Years Look Like

Comments
3 min read
Why Setting Up Observability Takes Forever (And What To Do About It)

Why Setting Up Observability Takes Forever (And What To Do About It)

Comments
4 min read
Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Comments
12 min read
Stop breaking production: a migration path to unified platforms 🛠️

Stop breaking production: a migration path to unified platforms 🛠️

Comments
1 min read
Building a Career in SRE: From Junior to Staff

Building a Career in SRE: From Junior to Staff

Comments
2 min read
The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

Comments
15 min read
CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

Comments
5 min read
I'm building a read-only context engine for Kubernetes and AI agents

I'm building a read-only context engine for Kubernetes and AI agents

Comments
6 min read
The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

Comments
7 min read
I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

Comments
4 min read
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Comments
5 min read
What is SRE? A Beginner's Guide to Site Reliability Engineering

What is SRE? A Beginner's Guide to Site Reliability Engineering

Comments
5 min read
Ongrid : open-source ops AI agent for RCA and remediation from chat

Ongrid : open-source ops AI agent for RCA and remediation from chat

Comments
1 min read
Incident Automation: What to Automate, What to Leave to Humans

Incident Automation: What to Automate, What to Leave to Humans

Comments
2 min read
I built a small tool to answer a question I’ve asked too many times: is this production ready?

I built a small tool to answer a question I’ve asked too many times: is this production ready?

Comments
2 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.