platform-notes

February 22, 2026 • 10 min read

LLMs Didn't Kill Debugging - They Just Moved It Upstream

LLM-assisted coding shifts failures from runtime crashes to design-time assumptions, demanding stronger invariant-driven reviews.

The incident didn’t start with a crash. No red dashboards. No 500s. No pager.

It started with a perfectly valid pull request that sailed through CI, passed unit tests, and made everyone feel good about velocity, right up until the on-call noticed something uncomfortable in the morning: a spike in “successful” requests that were doing the wrong thing.

The worst failures in modern systems aren’t always noisy. With LLM-assisted development, we’re shipping a new category of quiet failure: code that compiles, tests, deploys, and lies.

Debugging didn’t disappear. It moved earlier from runtime failures to design time assumptions.

And if you’re a senior engineer who feels slower with AI than you expected… you’re not imagining it.


Debugging used to mean “why did it crash?”

In DevOps and platform work, debugging has always been a negotiation with reality:

  • a container is OOMKilled because memory limits were wrong
  • a node pool is draining because of a misconfigured disruption budget
  • a rollout stalls because readiness probes don’t reflect real readiness
  • a database is slow because an index isn’t being used

Classic debugging is empirical. The system did something observable. You collect evidence (logs, metrics, traces), form a hypothesis, run an experiment, and converge.

Most of our tooling like observability stacks, runbooks, incident retros assumes this shape: symptom → signal → root cause → fix → prevention

LLMs change the shape.


How LLMs change failure modes

LLMs don’t usually produce code that crashes immediately.

They produce code that is:

  • plausible
  • idiomatic
  • consistent with common patterns
  • wrong in subtle ways when your environment deviates from the “typical” case

Which is most environments.

The failure modes shift from runtime correctness to semantic correctness.

You still have bugs, but the system is now broken in ways that look “successful” to CI and even to production monitoring.

Examples I’ve seen (or had to unwind) in platform contexts:

  • Permission logic that “looks right” but violates least privilege by defaulting to broad roles.
  • Retries added everywhere that silently amplify load during partial outages (classic self-inflicted DDoS).
  • Terraform that plans cleanly but encodes a wrong invariant (like assuming a resource name is stable across environments).
  • Kubernetes manifests that are valid YAML but wire the wrong service account, wrong selector, or wrong port.

The outputs are syntactically correct. The intent is what’s broken.

That’s the upstream shift: debugging becomes less about what happened and more about what we assumed.


The rise of “semantic bugs”

I’ve started calling these semantic bugs: the program behaves exactly as written, but not as intended.

They thrive in the gap between:

  • what you meant
  • what you asked the model
  • what the model inferred
  • what the system actually guarantees

Semantic bugs are painful because they don’t give you the usual debugging footholds. There’s no stack trace that says:

“Invariant violated: you assumed idempotency.”

Instead, you get outcomes that are “reasonable,” just wrong enough to cause business impact later.

A concrete example: “helpful” caching that breaks consistency

Here’s a simplified version of a bug class that shows up a lot with AI-generated code: adding caching where it seems beneficial.

from functools import lru_cache
import requests

@lru_cache(maxsize=1024)
def get_feature_flags(env: str) -> dict:
    resp = requests.get(f"https://flags.internal/{env}")
    resp.raise_for_status()
    return resp.json()

This is clean. It’s readable. It’s “best practice” in a vacuum.

In a real platform, it can be catastrophic if:

  • feature flags are expected to change quickly (incident response toggles)
  • the service runs as a long-lived process (cache never naturally clears)
  • env isn’t the real cache key (tenant/app/version/user segmentation exists)
  • stale flags cause unsafe rollouts

Nothing crashes. Everything “works.” Your rollback lever turns into a placebo.

Debugging this later is miserable because metrics and logs tell you requests are fine. It’s the meaning that’s wrong: you assumed flags were static enough to cache.


Why senior engineers often feel slower, not faster

LLMs are incredible at accelerating typing.

Senior engineers aren’t paid for typing.

They’re paid for holding invariants in their head:

  • What must never happen?
  • What assumptions are safe?
  • What failure modes are acceptable?
  • What tradeoffs are intentional?
  • What constraints exist because of compliance, scale, latency, cost, or historical scars?

LLMs don’t carry those constraints unless you inject them and even then, they may “smooth them out” into something that sounds aligned.

So you end up doing a different kind of work:

  • more reading
  • more auditing
  • more adversarial thinking
  • more explaining to the team why “it looks fine” isn’t a proof

That can feel like friction. It’s not friction, it’s the real work surfacing.

If you’re the one responsible for production, you’re not just debugging code anymore. You’re debugging reasoning.


The moment it clicked for me

My turning point wasn’t a clever log query. It was a diff.

We were reviewing an AI-assisted change that touched a deployment pipeline. The change was small, almost boring: “make rollouts safer by adding retries and backoff.”

The logic was clean. Tests passed. The model even added comments.

But one line bothered me:

# pseudo-ish CI config
retry:
  max: 5
  when: always

when: always” is where outages go to multiply.

That retry policy didn’t distinguish between:

  • transient network blips (reasonable to retry)
  • deterministic failures (bad config, auth, schema mismatch)
  • safety failures (policy checks, approvals, drift detection)

By treating all failures as retryable, the pipeline would keep hammering dependencies and hide the true error behind noise.

We weren’t debugging a failing pipeline. We were debugging an assumption:

“Retries are always safer.”

They’re not. They’re safer only when you understand the failure mode.

That’s when I started thinking about upstream debugging as a first-class discipline.


A new debugging mindset: design-time observability for assumptions

If LLMs move bugs upstream, the fix isn’t “don’t use LLMs.”

The fix is to make assumptions visible earlier before runtime.

Here’s what that looks like in practice for platform/DevOps teams.

1) Write invariants down (yes, literally)

Before you accept AI-generated logic, force the invariants into text.

A lightweight template I use in PR descriptions:

  • Safety invariant: What must never happen?
  • Correctness invariant: What must always be true?
  • Operational invariant: What must remain observable and debuggable?
  • Security invariant: What access boundaries must not expand?

If the author (or reviewer) can’t articulate these, you don’t have enough clarity to trust generated code.

2) Turn invariants into executable checks

If an invariant matters, encode it.

For IaC and platform work, that often means policy-as-code and assertions.

Terraform plan guardrails (OPA/Rego style pseudo-example)

# rego (conceptual)
deny[msg] {
  input.resource_changes[_].type == "azurerm_role_assignment"
  input.resource_changes[_].change.after.role_definition_name == "Owner"
  msg := "Disallow Owner role assignment by default"
}

The goal isn’t perfect policy coverage. The goal is to reduce the surface area where semantic bugs can hide.

LLMs are great at generating “valid” Terraform that accidentally expands privileges. A deny rule like this catches it even when the plan looks normal.

3) Add “semantic tests,” not just unit tests

Unit tests validate logic against examples.

Semantic tests validate intent against invariants.

For platform code, semantic tests often look like:

  • idempotency checks
  • permission boundary checks
  • failure-mode tests (what happens when dependency returns 403/429/500?)
  • chaos-style tests in staging

Here’s a tiny example for idempotency:

def test_apply_is_idempotent(client):
    first = client.apply(desired_state())
    second = client.apply(desired_state())
    assert second.changes == [], "apply() should be idempotent"

LLMs frequently generate “apply/update” workflows that aren’t truly idempotent because they assume a clean state. This test forces the invariant.

4) Make provenance and review more explicit

“AI wrote it” isn’t a moral failing. But it is risk information.

Two practical PR hygiene moves:

  • Require authors to add a short “Generated/Assisted” note plus what they manually validated.
  • Add a review checklist item: “What invariants did you verify?”

This prevents the most common failure mode I see right now: engineers trusting plausibility as correctness.

5) Instrument the decision points, not just the outcomes

If semantic bugs hide in intent, instrument places where intent becomes action:

  • feature-flag evaluation and caching behavior
  • policy decisions (authz allow/deny, admission decisions)
  • rollout gates (why did we proceed?)
  • reconciliation loops (why did we mutate state?)

In practice that means logs/traces like:

  • decision=allow reason=policy_rule_12
  • cache_hit=true ttl_remaining=...
  • rollback_triggered_by=...

Not because you love logs. Because you want the system to explain itself when assumptions are wrong.


Validation: how you know upstream debugging is working

You won’t get a magical “AI bug rate” metric.

But you can watch for healthier signals:

  • Fewer “noisy” incidents caused by retry storms and thundering herds
  • Shorter time-to-understand during retros (clearer causal chains)
  • More failures caught in PR/staging via invariants/policy checks
  • Less “works on my machine” drift because assumptions are encoded

The biggest qualitative signal is cultural:

Reviews shift from “does this code look right?” to “what must be true for this to be safe?”

That’s upstream debugging.


Tradeoffs and alternatives considered

You can respond to AI-driven semantic bugs in a few ways:

Approach A: Ban AI-generated code

Pros: reduces one source of wrong assumptions
Cons: unrealistic, loses productivity, pushes AI usage underground

Approach B: Trust but verify (lightweight)

Add PR checklists, insist on invariants, increase review rigor.
Pros: cheap, immediate
Cons: depends on discipline; can degrade into performative process

Approach C: Encode invariants as policy/tests (preferred)

Treat invariants as artifacts: tests, policy-as-code, staged rollouts, observability.
Pros: scalable, automatable, catches silent failures
Cons: upfront cost, requires clarity and maintenance

For teams operating production platforms, Approach C is the only one that scales. The cost is real, but it’s paid once and amortized over every future change (human or AI-assisted).


Production hardening: edge cases you can’t ignore

Semantic bugs love edge cases because edge cases are where your hidden constraints live.

A few I’d explicitly design for in AI-assisted changes:

  • Rate limits and retries: avoid “retry always,” distinguish transient vs deterministic failures.
  • Consistency vs caching: cache only when you can define staleness tolerance.
  • AuthZ drift: enforce least privilege with policy checks on plans and manifests.
  • Rollback semantics: ensure rollbacks undo effects, not just deploy previous code.
  • Multi-tenant invariants: validate that tenant boundaries exist in code paths and cache keys.
  • Observability budgets: don’t add “smart” logic without adding “explainable” telemetry.

The pattern is consistent: if the model didn’t know your constraint, it probably didn’t encode it.


Lessons learned (mental models that stuck)

  1. Plausibility is not proof.
    AI makes code look finished. Finished is not correct.

  2. Debugging is now a design activity.
    If you can’t state your invariants, you can’t debug your future incident.

  3. Semantic bugs require semantic tests.
    Test intent: idempotency, boundaries, failure modes not just happy paths.

  4. Policy is a force multiplier.
    Humans miss subtle risk expansions. Policy catches them consistently.

  5. Senior engineering is constraint management.
    AI helps with synthesis; you’re still responsible for the shape of reality.


Closing reflection

I don’t think LLMs are making engineers careless on purpose.

I think they’re making it easier to skip the hardest part of engineering: reasoning through invariants under constraints.

When the model gives you a clean solution in five seconds, it feels wasteful to spend an hour interrogating it. But that hour is where safety lives.

The practical takeaway isn’t “slow down.”

It’s: move your debugging earlier. Make assumptions explicit. Encode invariants. Instrument decision points. Treat intent as something you can test.

If you’ve shipped an “AI hallucination bug”, something that looked right until it didn’t, I’d genuinely like to hear what form it took. Was it permissions? retries? caching? infra drift? something weirder?


Final takeaways

  • LLMs didn’t eliminate debugging; they shifted it upstream from runtime failures to design time assumptions.
  • The new enemy is the semantic bug: correct syntax, wrong intent.
  • Senior engineers feel slower because they’re debugging reasoning, constraints, and invariants not just code.
  • The fix is executable intent: invariants → tests/policy → instrumentation.
  • Trust AI for acceleration, not correctness. Treat plausibility as a starting point, not a conclusion.

Related posts