Originally published on kuryzhev.cloud
The Scenario
Your runbook was written six months ago by someone who no longer works here — and it's 3am, three payments-api pods are OOMKilling in a cascade, and the on-call engineer is staring at a Confluence page that references a Datadog dashboard that got renamed in February. This is the real failure mode nobody talks about in postmortems: the documentation debt that compounds silently until it explodes during a P1.
I've been in that seat. The thing that changed how I handle incident documentation wasn't a better wiki tool or a stricter postmortem template. It was treating prompt engineering for SRE workflows as reusable infrastructure — not one-off ChatGPT queries. Generic prompts fail in SRE contexts for a specific reason: they have no system topology, no severity framing, and no structured output contract. You ask "how do I fix an OOMKill?" and you get a textbook answer that has nothing to do with your 512Mi memory limit, your Redis connection pool, or your GKE 1.29 cluster. What you actually need is a prompt pattern that injects your environment's context and enforces a structured response you can act on immediately.
This post walks through three production-tested prompt engineering patterns for SRE playbooks: structured context injection for runbook generation, two-step postmortem synthesis, and LLM-as-reviewer for runbook auditing. These patterns are most powerful when maintained like code — versioned, reviewed, and updated after every incident.
Prerequisites
Before you run any of this, get your environment straight. You'll need Python 3.11+ with two key libraries: openai==1.30.0 and tiktoken==0.7.0. If you're on an older openai SDK, watch out — the v1.x release broke every legacy openai.ChatCompletion.create() call from v0.28.x. I spent an embarrassing hour debugging that the first time I upgraded. Install the correct versions explicitly:
pip install openai==1.30.0 tiktoken==0.7.0
For the model itself, you need GPT-4o access on your OpenAI API key. GPT-4o supports the response_format={"type": "json_object"} parameter we rely on for structured output, and its 128k context window handles roughly 90 pages of logs — enough for most incidents. Cost is approximately $5 per million input tokens as of mid-2024, so a two-step postmortem chain on a 50k-token log file runs about $0.35 per incident. At 100 incidents per month that's $35 — negligible. But set max_tokens limits on every call anyway, because malformed inputs can balloon costs fast.
If you're in an air-gapped environment and can't send data to OpenAI, Ollama 0.1.32+ with llama3:70b is a viable alternative. Run ollama pull llama3:70b — just know it requires roughly 40GB of disk and a minimum of 64GB RAM. Without that RAM headroom you'll hit swap thrashing and inference becomes unusable under load.
You'll also want at least one real or synthetic postmortem in Markdown or JSON, a sample runbook in plain text, and optionally a PagerDuty API token for pulling live incident metadata. Store your prompt templates under a versioned path — I use the convention ./prompts/sre/postmortem_synthesis_v2.txt. Version your prompts like you version your Terraform modules. They are infrastructure.
Pattern 1 — Structured Context Injection for Playbook Generation
The core problem with generic LLM runbooks is that the model has no idea what your service actually looks like. This pattern solves that by building a structured context block — service name, dependencies, SLO thresholds, known failure modes, namespace — and injecting it as a YAML-style front-matter block before the user query. Pair this with role-priming in the system message and JSON schema enforcement in the prompt, and you get output that's immediately actionable rather than generically educational.
The function below sets up the role-primed system message, builds the context block from a service metadata dictionary, and enforces a strict JSON schema for the response. We use response_format={"type": "json_object"} on the API call to guarantee parseable output. We also run a token count before every API call using tiktoken with the cl100k_base encoding — this prevents the openai.BadRequestError: 400 - context_length_exceeded error that will otherwise ambush you when someone injects an entire application log as context.
Watch out for this: Never inject raw 10,000-line log files as context. The model attends poorly to tokens in the middle of a massive context window. Pre-filter to error-level events and key timestamps before injection. I learned this the hard way after getting a perfectly formatted runbook that addressed the wrong error entirely — the actual OOMKill signal was buried at line 4,300 and the model focused on a harmless warning at line 200.
# sre_prompt_engine.py
# Prompt engineering patterns for SRE playbooks and postmortems
# Requires: openai==1.30.0, tiktoken==0.7.0, python 3.11+
import json
import tiktoken
from openai import OpenAI
from pathlib import Path
client = OpenAI() # reads OPENAI_API_KEY from environment
# --- Token safety: never exceed 100k tokens of context ---
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
def truncate_to_token_limit(text: str, limit: int = 80000) -> str:
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
if len(tokens) > limit:
print(f"[WARN] Truncating context from {len(tokens)} to {limit} tokens")
tokens = tokens[:limit]
return enc.decode(tokens)
# --- Pattern 1: Structured Context Injection for Playbook Generation ---
def generate_playbook(service_context: dict, alert_description: str) -> dict:
"""
Injects structured service metadata + alert into a role-primed prompt.
Returns a JSON-structured runbook with immediate_actions, rollback_command, etc.
"""
system_prompt = """You are a senior SRE at a company running Kubernetes 1.29 on GKE
with Datadog monitoring and PagerDuty alerting. You write precise, actionable runbooks.
Never suggest commands you cannot verify. Flag uncertain steps with [UNVERIFIED]."""
# Build context block from service metadata
context_block = f"""
## Service Context
- Service: {service_context['name']}
- Dependencies: {', '.join(service_context['dependencies'])}
- SLO: {service_context['slo_target']}% availability over 30 days
- Known failure modes: {json.dumps(service_context['known_failures'], indent=2)}
- Namespace: {service_context['namespace']}
"""
user_prompt = f"""
{context_block}
## Active Alert
{alert_description}
Generate a runbook for this alert. Respond ONLY with valid JSON matching this schema:
{{
"title": "string",
"severity": "P1|P2|P3",
"immediate_actions": ["string"],
"diagnostic_commands": ["string"],
"escalation_path": ["string"],
"rollback_command": "string",
"blast_radius": "string",
"estimated_resolution_time": "string"
}}
"""
token_count = count_tokens(system_prompt + user_prompt)
print(f"[INFO] Sending {token_count} tokens to GPT-4o")
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.2, # low temp = consistent commands, less hallucination
max_tokens=1500,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
)
return json.loads(response.choices[0].message.content)
# --- Example usage ---
if __name__ == "__main__":
service_ctx = {
"name": "payments-api",
"dependencies": ["postgres-primary", "redis-cache", "stripe-gateway"],
"slo_target": 99.9,
"namespace": "production",
"known_failures": [
"OOMKill under >500 concurrent requests",
"Redis connection pool exhaustion during deploy"
]
}
alert = "OOMKilling detected on payments-api pods (3/5 pods restarted in last 10 minutes). Memory limit: 512Mi."
playbook = generate_playbook(service_ctx, alert)
print(json.dumps(playbook, indent=2))
# Save versioned output — treat generated runbooks as artifacts
Path("./runbooks/generated/payments-api-oomkill.json").write_text(
json.dumps(playbook, indent=2)
)
The temperature: 0.2 setting is non-negotiable for SRE use cases. I tested this extensively — at temperature: 0.7 the model starts inventing kubectl flags that don't exist. At 0.2 you get deterministic, consistent output across runs. More on that in the verification section.
Pattern 2 — Two-Step Postmortem Synthesis
The single biggest mistake I see teams make with LLM postmortem generation is using one monolithic prompt for both timeline extraction and root cause analysis. One prompt, one job. Splitting into a two-step chain improves RCA accuracy measurably — the first pass extracts a clean chronological timeline from raw logs and Slack threads, and the second pass uses that structured timeline to synthesize the full postmortem. You're giving the model cleaner, denser signal at each step rather than asking it to do two cognitively distinct tasks simultaneously.
The two-step chain below enforces Google SRE postmortem format via explicit section headers in the prompt instruction. The first call extracts timeline events as a JSON array with timestamp, event, and source fields. The second call uses that output to write the full document with ## Summary, ## Timeline, ## Root Cause, ## Contributing Factors, ## Impact, and ## Action Items sections.
Critical security note: Never inject raw PagerDuty API responses or Datadog alert payloads directly into prompts sent to OpenAI. Strip PII, internal hostnames, and IP addresses first. Build a sanitization function and make it mandatory in your pipeline — not optional. I treat this the same way I treat secrets management: if it's not enforced in code, it will eventually be violated under incident pressure at 3am.
# --- Pattern 2: Two-Step Postmortem Synthesis ---
def synthesize_postmortem(raw_logs: str, slack_thread: str) -> str:
"""
Step 1: Extract timeline from raw data.
Step 2: Synthesize full postmortem from timeline.
Splitting into two calls improves RCA accuracy significantly.
"""
# Sanitize before sending — strip internal IPs and hostnames
# In production: replace with your sanitization function
safe_logs = truncate_to_token_limit(raw_logs, limit=60000)
# Step 1: Timeline extraction — one job, clean output
step1_response = client.chat.completions.create(
model="gpt-4o",
temperature=0.1, # even lower temp for factual extraction
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""Extract a chronological incident timeline from the following logs and Slack thread.
Output as a JSON array of objects with fields: timestamp, event, source (logs|slack|alert).
Logs:
{safe_logs}
Slack thread:
{slack_thread}"""
}]
)
timeline = step1_response.choices[0].message.content
# Step 2: Full postmortem synthesis using the extracted timeline
step2_response = client.chat.completions.create(
model="gpt-4o",
temperature=0.2,
max_tokens=3000,
messages=[{
"role": "user",
"content": f"""Using this incident timeline, write a Google SRE-format postmortem.
Use these exact section headers:
## Summary
## Timeline
## Root Cause
## Contributing Factors
## Impact
## Action Items
Timeline:
{timeline}"""
}]
)
return step2_response.choices[0].message.content
After generating the postmortem JSON, pipe it through jq to assert required fields exist before writing to Confluence or Notion. Something as simple as jq 'has("root_cause") and has("action_items")' will catch incomplete outputs before they become official documentation. The OOMKill diagnostic command I use to pull initial incident context before feeding to the synthesizer: kubectl get events --field-selector reason=OOMKilling -n production --sort-by='.lastTimestamp'.
Pattern 3 — Runbook Validation with LLM-as-Reviewer
This is the pattern I use on every PR that touches a runbook. The idea is simple: put the LLM in a critic role, give it the existing runbook as context, and ask it to output a gap analysis as a numbered list with severity tags. The prompt explicitly includes a "devil's advocate" instruction — ask the model to simulate what breaks if each step is followed during a partial network outage. That single addition has caught more gaps than any human review I've seen.
# validate_runbook.py
# Pattern 3: LLM-as-Reviewer — audit existing runbooks for gaps
# Outputs gap analysis suitable for GitHub PR comment or Confluence
import sys
from openai import OpenAI
client = OpenAI()
CRITIC_SYSTEM_PROMPT = """You are a senior SRE performing a runbook audit.
Your job is to find gaps, outdated commands, missing rollback steps, and
unclear escalation paths. Be specific. Reference line numbers where possible.
Never hallucinate fixes — if you are unsure, say so."""
def audit_runbook(runbook_text: str, service_name: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.2,
max_tokens=2000,
messages=[
{"role": "system", "content": CRITIC_SYSTEM_PROMPT},
{"role": "user", "content": f"""
Audit this runbook for service: {service_name}
For each issue found, output a line in this format:
[SEVERITY] Line ~N: Description of issue. Suggested fix.
Severity levels: [CRITICAL] [WARN] [INFO]
Also answer: What breaks if this runbook is followed during a partial network outage?
Runbook:
---
{runbook_text}
---
"""}
]
)
return response.choices[0].message.content
if __name__ == "__main__":
runbook_path = sys.argv[1] if len(sys.argv) > 1 else "./runbooks/payments-api-oomkill.md"
with open(runbook_path) as f:
runbook = f.read()
result = audit_runbook(runbook, service_name="payments-api")
print(result)
# Write to file for GitHub Actions PR comment injection
with open("./runbook_audit_output.txt", "w") as out:
out.write(result)
A real audit output from this pattern looks like this:
[CRITICAL] Line ~12: `kubectl delete pod` used without --grace-period=0 flag.
During OOMKill cascade this will hang. Use: kubectl delete pod <name> -n production --grace-period=0 --force
[WARN] Line ~24: Rollback step references image tag 'latest'.
This is non-deterministic. Pin to a specific SHA or semver tag.
[WARN] Line ~31: Escalation path lists @oncall-infra but no fallback if unresponsive after 15 min.
Add secondary escalation to #incident-bridge Slack channel.
[INFO] Line ~8: Datadog dashboard link is hardcoded to staging environment URL.
Update to production dashboard ID.
Partial network outage scenario:
Step 3 calls `kubectl rollout restart` which requires API server connectivity.
If the network partition isolates the ops host from the control plane,
this step will hang indefinitely with no timeout. Add: --timeout=60s
Include the instruction "Do not suggest commands you cannot verify. If uncertain, flag with [UNVERIFIED]" explicitly in your prompt. This single line meaningfully reduces hallucinated kubectl flags. I stopped trusting runbook audits that didn't include this guard after a model confidently suggested a --cascade=orphan flag combination that doesn't exist in Kubernetes 1.29.
Verify and Test
Prompt engineering for SRE playbooks is only trustworthy if you validate consistency. Run the same playbook generation prompt five times with temperature: 0.2 and diff the outputs. Acceptable variance is formatting only — different whitespace, slightly reworded descriptions. If your rollback_command or escalation_path changes between runs, your prompt context is underspecified. Add more constraints until those fields stabilize across five runs.
Validate every generated kubectl and gcloud command against a dry-run or staging cluster before embedding in production runbooks. This is non-negotiable. The model can generate syntactically valid commands that are semantically wrong for your cluster version or RBAC configuration. A dry-run costs nothing and catches these before they cause damage during an actual incident.
For a more rigorous quality gate, use an LLM-as-judge pattern: a second prompt evaluates the first output against a rubric covering completeness, accuracy, and actionability, returning a score from 1 to 5. Any output scoring below 4 gets flagged for human review before it's committed to your runbook repository. This is the pattern I'd recommend integrating into your CI pipeline alongside the runbook auditor — check out the DevOps automation patterns on kuryzhev.cloud for complementary CI pipeline setups that fit this workflow.
Finally, pipe your JSON postmortem output through jq to assert required fields exist: jq 'has("root_cause") and has("action_items")' postmortem.json. If the assertion fails, the pipeline fails. Treat LLM output validation the same way you treat infrastructure state validation — trust but verify, every time. The OpenAI structured outputs documentation covers the response_format parameter in detail, and the kubectl cheat sheet is worth keeping open when validating generated diagnostic commands.
These three prompt engineering patterns for SRE playbooks — context injection, two-step postmortem synthesis, and LLM-as-reviewer — are most powerful when your context blocks are maintained like code: versioned in Git, reviewed in PRs, and updated after every incident that reveals a gap. The patterns themselves are stable. The context they inject is what degrades over time if you let it. Treat your ./prompts/sre/ directory with the same discipline you apply to your Terraform modules, and your on-call engineers will have documentation that's actually useful at 3am — not a monument to someone who left in February.
Top comments (0)