SLO Design for Platform Teams: A Practical Guide

Why Most SLOs Fail

Here's the uncomfortable truth: most organizations that adopt SLOs end up with numbers on a dashboard that nobody looks at and nobody acts on.

We've seen three failure patterns repeatedly:

1.SLOs disconnected from user experience — measuring server CPU instead of checkout success rate
2.SLOs without error budgets — targets with no consequences when they're missed
3.SLOs set by ops, not agreed with product — no business buy-in means no prioritization

This guide shows you how to design SLOs that actually drive reliability decisions.

Start with User Journeys, Not Infrastructure

The first mistake teams make is starting with infrastructure metrics. "Our API should have 99.9% uptime" — but what does that mean for users?

Instead, start by mapping critical user journeys:

User Journey: "Complete a Purchase"
├── Browse products (search, filter, view)
├── Add to cart
├── Enter payment details
├── Submit order
└── Receive confirmation

Each step has different reliability requirements.

Choosing the Right SLI Type

For each journey, select the appropriate Service Level Indicator:

SLI Type	What It Measures	Best For
Availability	% of requests that succeed	APIs, web pages
Latency	% of requests faster than threshold	User-facing endpoints
Quality	% of responses with correct data	Data pipelines, search
Freshness	% of data updated within threshold	Dashboards, caches

Example: E-commerce Platform SLIs

yaml

slis:
  checkout_availability:
    description: "Successful checkout completions"
    good_events: "HTTP 2xx responses to POST /api/checkout"
    total_events: "All POST /api/checkout requests"

  search_latency:
    description: "Search results returned within 200ms"
    good_events: "GET /api/search responses < 200ms"
    total_events: "All GET /api/search requests"

  order_processing_freshness:
    description: "Orders processed within 5 minutes"
    good_events: "Orders with processing_time < 300s"
    total_events: "All submitted orders"

Setting the Right Target

The 100% Trap

Never set an SLO at 100%. It's mathematically impossible to maintain, and attempting it means you'll never ship anything because every change carries risk.

How to Choose Your Number

The target should reflect the point at which users start to notice and care:

SLO Target	Monthly Error Budget	Meaning
99.99%	4.3 minutes	Users never notice issues
99.9%	43 minutes	Brief, rare disruptions
99.5%	3.6 hours	Occasional degradation OK
99%	7.2 hours	Non-critical service

Rule of thumb: Start with 99.5% for internal services, 99.9% for customer-facing services. You can always tighten later.

The Error Budget Conversation

This is where SLOs become powerful. The error budget is the inverse of your SLO — it's how much unreliability you can tolerate.

SLO: 99.9% availability over 30 days
Error budget: 0.1% = 43 minutes of downtime

If you've used 40 of your 43 minutes by day 20:
→ STOP deploying non-critical changes
→ Focus on reliability improvements
→ Only emergency fixes until the budget resets

This creates a natural tension between velocity and reliability. When the error budget is healthy, ship fast. When it's burning, slow down and stabilize.

Implementing SLOs in Practice

Step 1: Instrument Your SLIs

Use OpenTelemetry to instrument your services:

typescript

// Record SLI data with OpenTelemetry
import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('checkout-service');
const checkoutCounter = meter.createCounter('checkout.requests.total');
const checkoutSuccessCounter = meter.createCounter('checkout.requests.success');
const checkoutLatency = meter.createHistogram('checkout.latency.ms');

async function handleCheckout(req, res) {
  const start = Date.now();
  checkoutCounter.add(1);

  try {
    const result = await processCheckout(req.body);
    checkoutSuccessCounter.add(1);
    res.json(result);
  } catch (error) {
    res.status(500).json({ error: 'Checkout failed' });
  } finally {
    checkoutLatency.record(Date.now() - start);
  }
}

Step 2: Build SLO Dashboards

Your SLO dashboard should answer three questions at a glance:

1.Are we meeting our SLOs right now? (Current attainment)
2.How much error budget remains? (Burn rate)
3.At this rate, will we exhaust our budget? (Projection)

Step 3: Set Up Burn Rate Alerts

Don't alert on individual errors — alert on the rate at which you're consuming your error budget:

yaml

# Multi-window burn rate alert
alerts:
  - name: high_burn_rate_critical
    condition: burn_rate > 14.4x for 5 minutes  # Exhausts budget in 5 hours
    severity: page
    action: "Immediate investigation required"

  - name: high_burn_rate_warning
    condition: burn_rate > 6x for 30 minutes     # Exhausts budget in 1.25 days
    severity: ticket
    action: "Investigate within 4 hours"

  - name: elevated_burn_rate
    condition: burn_rate > 1x for 6 hours         # On track to exhaust budget
    severity: notify
    action: "Review in next standup"

Getting Business Buy-In

The Error Budget Policy

Write a one-page error budget policy that engineering AND product sign off on:

1.When error budget is healthy (> 50% remaining): Ship features freely, run experiments
2.When error budget is caution (25-50%): Continue shipping but increase review rigor
3.When error budget is critical (< 25%): Freeze non-critical deploys, prioritize reliability
4.When error budget is exhausted (0%): Full feature freeze until reliability is restored

Speaking the Language of Business

Don't say: "Our p99 latency exceeded the SLO threshold."

Do say: "3% of customers experienced slow checkouts this week, which our data shows correlates with a 12% drop in conversion rate. We need to invest 2 sprints in reliability before launching the new payment feature."

Common Pitfalls

1.Too many SLOs: Start with 3-5 covering your most critical user journeys. More SLOs = more noise.
2.SLOs on vanity metrics: CPU utilization is not an SLO. User-facing success rate is.
3.No consequences: An SLO without an error budget policy is just a number.
4.Set and forget: Review and adjust SLOs quarterly based on user feedback and business changes.

Getting Started

1.Map your top 3 critical user journeys
2.Define one SLI per journey
3.Set initial targets at 99.5% (you can tighten later)
4.Build a dashboard and burn rate alerts
5.Write a one-page error budget policy
6.Get product and engineering to sign it

The entire process takes 2-3 weeks. The cultural shift takes longer — but it starts with the first conversation about error budgets.

Ready to implement SLOs for your organization? Talk to our SRE team.