How do you define SLIs/SLOs/SLAs and error budgets for a new service?

A reliable service starts with clear promises and measurable signals. SLIs are the exact measurements of user experience. SLOs are the targets you aim to meet for those measurements. SLAs are legal or commercial commitments. Error budgets are the small allowance of failure that keeps teams shipping while staying reliable. Set them well and you get a feedback loop that aligns product velocity with dependable user outcomes.

Why it Matters

Every system design interview expects you to balance reliability with speed. Hiring managers listen for crisp definitions, sensible targets, and a practical plan to measure and act. In real platforms this framework improves on call focus, reduces alert noise, and gives leadership a shared language to trade reliability for features when it makes sense. In distributed systems the SLI and SLO vocabulary also connects directly to capacity planning, traffic shaping, and incident response.

How it Works Step by Step

Map the critical user journeys List the tasks that define success for the user. Examples include open the app, search, add to cart, checkout, watch video start, post comment. Keep the list short so that it represents the business value path.
Choose an SLI per journey Pick one or two user centric signals per journey. Good defaults
- Availability SLI share of requests that succeed
- Latency SLI P95 or P99 request latency under a threshold
- Quality SLI percent of video starts without rebuffer, freshness of feed items, correctness of price totals
Define precise measurement For each SLI specify unit, path, and window. Example
- Scope production traffic only
- Metric HTTP 2xx and 3xx as success, 4xx and 5xx as failure
- Latency measured at server boundary or at client beacon choose one and stick to it
- Time window rolling 30 days for SLOs, shorter windows for alerting
Establish the baseline If you have an existing version, measure current SLIs for at least one week. If it is new, estimate based on similar services. The baseline informs targets that are ambitious but reachable.
Set SLO targets Attach a numeric target to each SLI and a window. Examples
- Availability SLO 99.9 percent over 30 days
- Latency SLO P99 under 300 milliseconds over 30 days Use fewer targets rather than many. Each target must be meaningful to the user, not just to infrastructure.
Compute the error budget Error budget is one minus the SLO target for the window. For 99.9 percent availability over 30 days, the budget is 0.1 percent of bad minutes or bad requests in that period. Treat the budget as a shared account for the team.
Create budget policies Decide actions based on budget health. Examples
- If budget burn rate breaches a fast threshold across 5 minute windows, page the team immediately
- If cumulative burn reaches 50 percent mid window, limit risky launches
- If budget is fully consumed, freeze changes except high value fixes until the next window
- If budget is healthy, allow extra experiments Burn rate alerts are strong because they scale with target strictness.
Design alerting from SLIs Alert on user pain, not on low level host metrics. Use two stages
- Fast burn alerts small windows to catch severe incidents
- Slow burn alerts longer windows to catch chronic issues Keep alert definitions in version control next to code.
Add SLAs only when needed SLAs are contracts with customers. They include credits or penalties. Keep SLAs looser than SLOs and include exclusions such as provider failures or scheduled maintenance. Be sure you can measure and prove compliance.
Instrument end to end Add request tracing, structured logs, and metrics with consistent labels such as endpoint, region, downstream service, and customer tier. For latency, store full histograms so you can compute P95 and P99 without error.
Review and iterate Publish a monthly reliability report. Cover SLI trends, SLO compliance, budget balance, top incidents, and actions. When product goals shift, update SLIs and targets rather than adding more. Simplicity wins.
Bake into delivery workflow Gate risky deployments on budget health. Require a rollback plan for changes that can threaten SLIs. Track budget drawdown per release to learn which features are risky.

Real world example

Consider a checkout service similar to a large retail platform. The critical journey is place order.

SLIs availability of order placement and P99 latency of place order API.
SLOs 99.95 percent availability over 30 days and P99 under 450 milliseconds.
Error budget 0.05 percent of failed order attempts in the window.
Policies two stage burn rate alerts. If the 5 minute burn exceeds 10 percent of monthly budget, page immediately. If the 1 hour burn exceeds 5 percent of monthly budget, page secondary. If 50 percent of budget is gone before day 15, limit feature rollouts and prioritize fixes.
SLAs for enterprise customers guarantee 99.9 percent availability with service credits if violated. The SLA is looser than the SLO to keep safety margin.

This setup directs attention to user pain minimal failed orders and fast confirmation while still letting product teams ship.

Common Pitfalls or Trade offs

Picking SLIs that are not user centric CPU or memory metrics do not reveal user experience. Anchor on success rate, latency, and content quality.
Too many SLOs More targets reduce clarity and create conflicting alerts. Focus on the journeys that directly produce business value.
Targets that are too tight too early A new service rarely hits perfect numbers. Start with a level you can meet, then raise after real usage.
Alert storms from noisy metrics If you alert on every low level metric you will page the team without user pain. Tie alerts to budget burn on the SLIs.
Unverifiable SLAs Never sign a promise you cannot measure. If you lack client side beacons for mobile, do not claim mobile experience in the SLA.
Ignoring variance across regions or tenants Aggregate metrics can hide pockets of pain. Break out SLIs by region and customer tier when traffic patterns differ.

Interview Tip

Expect a prompt like this You are building a photo upload service for a global app. Pick two SLIs, propose SLO targets with windows, compute the error budget, and describe one fast and one slow burn alert. Also state when you would freeze releases.

A strong answer uses availability and P99 latency as SLIs, proposes realistic targets such as 99.9 percent and 250 milliseconds, explains the monthly budget, and ties alerts to budget burn. Closing with a release freeze rule shows understanding of product trade offs.

Key Takeaways

SLIs measure user experience. SLOs set numeric targets. SLAs are contractual promises. Error budgets enable safe speed.
Choose one or two user centric SLIs per critical journey and define scope and window precisely.
Set reachable targets from baselines and compute budgets as one minus the target.
Drive alerts from burn rates on SLIs and connect policies to release decisions.
Keep SLAs looser than SLOs and only where you can measure and prove.

Table of Comparison

Approach	What it optimizes	How targets are set	When to use	Risk
SLI with SLO and Error Budget	User experience and delivery speed	From baseline and product goals	Default for modern services	Requires strong instrumentation
SLI Monitoring without SLO	Observability only	No target	Early prototype	No clear alerting and no release policy
SLA Only	Commercial promises	Contract negotiation	External commitments	Can overpromise without internal SLO safety margin
OKRs Only	Business outcomes	Quarterly planning	High-level direction	Not actionable during incidents
Performance Budget in Frontend	Page weight and render speed	Asset size and timing goals	Web performance work	Does not capture availability or correctness

FAQs

Q1. What is the difference between SLI SLO and SLA?

SLI is a measurement of user experience, such as success rate or P99 latency. SLO is a target for that measurement, such as 99.9 percent. SLA is a contract with customers that may include credits when the promise is not met.

Q2. How do I pick the right SLIs for a new service?

Start from the critical journey. Choose availability and latency for the main endpoints, then add a quality SLI if the product demands it, such as video start without rebuffer or feed freshness.

Q3. What window should I use for SLOs and alerts?

Use a rolling 30 day window for SLOs in most services. Alerts should use burn rate across short and long windows to catch both fast and slow problems.

Q4. How strict should my first SLO be?

Pick a target you can meet with small headroom, then tighten over time. For a new backend service 99.9 percent availability is a common starting point.

Q5. How do error budgets change release decisions?

When the budget is healthy, you can take more risk and deploy more often. When the budget is low or consumed, pause risky launches and focus on reliability work.

Q6. Do I need SLAs if I only serve internal teams?

Often no. Internal teams may prefer internal objectives and an operational level agreement without credits. Use SLAs only when you have external customers or legal need.

Further Learning

To go deeper on reliability design patterns and trade offs, explore Grokking System Design Fundamentals for step by step foundations, then level up your production strategy in Grokking Scalable Systems for Interviews which covers advanced reliability, traffic control, and capacity planning in large distributed systems.

If you want end to end interview practice with realistic prompts, see Grokking the System Design Interview.