What is an Error-Budget Policy?

In SRE, an error-budget policy is the rulebook that uses the gap between your SLO and observed reliability to govern releases, risk, and incident response.

When to Use

Best used in high-availability systems, rapid-release environments, product migrations, and feature rollouts.

It enforces guardrails like “pause releases when budget is spent” or “require canary tests if burn rate is high.”

Example

If your API has a 99.9% SLO (43.2 minutes of downtime allowed per month), and you hit a 20-minute outage, you have 23.2 minutes left—future deploys slow down until reliability stabilizes.

Want to master system design and interview prep?

Check out Grokking System Design Fundamentals, Grokking the System Design Interview, Grokking Database Fundamentals for Tech Interviews, or Mock Interviews with ex-FAANG engineers.

Why Is It Important

It aligns engineering speed with customer experience, prevents over-reliance on gut feelings, and creates objective criteria for risk-taking.

Interview Tips

Explain SLI → SLO → error budget clearly, give a quick STAR example, and show you understand burn-rate alerts, release freezes, and exception handling.

Trade-offs

Pros: Data-driven risk management, predictable reliability, faster shipping when budget is healthy. Cons: Strict enforcement may delay features; loose goals risk customer dissatisfaction.