platform-notes

February 28, 2026 • 3 min read

Status Pages Aren't Observability: The Only Reliable Outage Signal Is Your Users (and Your Telemetry)

Provider status pages are communications artifacts; reliable outage detection requires synthetic user-journey probes, SLO-based alerting, and first-party telemetry.

At 2:13 a.m. my phone buzzed with the kind of alert that feels personal.

Not “CPU high” or “pod restarts.” It was worse: a Slack message from a customer. “Login is stuck. Our on call is watching it fail in real time. Is this you or your provider?”

I did what most of us do in that moment. I opened the provider status page. Everything was green.

For a few minutes I lived in the gap between “customers can’t use the product” and “officially operational”. In that gap, your incident process either becomes crisp or it becomes theater.

That night didn’t teach me a new monitoring tool. It taught me a detection philosophy: a provider status page is not an outage signal. It is a communications artifact. Your outage signal must come from your own SLOs and your own telemetry, especially when your system depends on hyperscalers and SaaS edges.

The real problem isn’t downtime, it’s ambiguity

Most incidents are not clean, binary failures. They are partial outages.

  • A subset of users can’t authenticate.
  • A particular API write path times out.
  • A single region is degraded.
  • A dependency is flaky enough to break workflows but not flaky enough to trip someone else’s thresholds.

Status pages struggle with that reality for structural reasons:

  1. They are scoped to the vendor’s view of their own infrastructure, not your user’s journey.
  2. They lag because acknowledgement is gated by human verification and communications processes.
  3. They compress complexity into a single label, often “operational,” “degraded,” or “partial outage,” which hides which surfaces are actually affected.

If you build detection on those pages, you are outsourcing your pager to someone else’s PR and incident comms workflow.

The turning point: treating customer journeys as first class signals

Instead of asking “are our services healthy,” we asked “can a user complete the workflow that creates value.”

That changed everything.

A good detection signal has three properties:

  1. It is external. It sees what the customer sees.
  2. It is synthetic. It does not require a real user to fail first.
  3. It is tied to an SLO. It has an explicit error budget and alert policy.

Example synthetic probe (k6)

import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  vus: 1,
  iterations: 1,
  thresholds: {
    checks: ["rate>0.99"],
    http_req_duration: ["p(95)<2000"],
  },
};

const BASE_URL = __ENV.BASE_URL;
const USER = __ENV.SYNTH_USER;
const PASS = __ENV.SYNTH_PASS;

export default function () {
  const loginRes = http.post(`${BASE_URL}/api/auth/login`, JSON.stringify({
    username: USER,
    password: PASS,
  }), { headers: { "Content-Type": "application/json" } });

  check(loginRes, {
    "login status is 200": (r) => r.status === 200,
  });

  const token = loginRes.json("token");

  const writeRes = http.post(`${BASE_URL}/api/orders`, JSON.stringify({
    sku: "synthetic-sku",
    qty: 1,
  }), {
    headers: {
      "Content-Type": "application/json",
      "Authorization": `Bearer ${token}`,
    },
  });

  check(writeRes, {
    "write status is 201": (r) => r.status === 201,
  });

  sleep(1);
}

Final takeaways

  • Provider status pages are communications, not detection.
  • Synthetic customer journey probes catch partial outages that infra metrics miss.
  • SLO based alerting turns probe failures into actionable paging.
  • Multi viewpoint monitoring across the real user path is critical.
  • Treat probes as production code, with ownership and runbooks.

Related posts