platform-notes

March 14, 2026 • 7 min read

The Version Pinning Illusion: Why Your IaC Still Moves Underneath You

Provider locks help, but they do not freeze cloud control planes. Why pinned IaC still drifts, and how to build safer contracts around it.

It usually shows up as an argument. One engineer says, “That cannot be the provider. We pinned it.” Another says, “Then why did production behavior change with an empty plan?”

That was the moment I stopped treating version pinning as a control and started treating it as a partial mitigation. We had done the responsible things: locked provider versions, committed the lock file, pinned shared modules, reviewed the plan. The deploy still ended with a service behaving differently from the last run. The hard part is that nobody is exactly wrong. Pinning does reduce change. It just does not reduce the kind of change most teams quietly assume it does.

Why this matters now

Infrastructure teams have become more disciplined about supply chain control. We pin providers. We verify module sources. We run CI on every infrastructure change. That is all good practice. The mistake is treating those controls as if they make infrastructure execution immutable.

They do not. Your IaC tool is still a client talking to a live control plane. Cloud APIs evolve. Default values change. backend services reinterpret fields. A resource that looked stable six months ago can start behaving differently even when your code, module version, and provider version have not moved.

That is why “but the version was pinned” keeps showing up in postmortems. Teams pinned the translator and assumed they had pinned the contract.

What pinning actually gives you

Version pinning is still worth doing. It gives you reproducibility in a narrow but important layer:

  • The same provider binary parses and renders your configuration.
  • The same dependency graph is used in CI and local runs.
  • The same module source is evaluated when the plan is produced.

That matters. It removes a class of accidental drift caused by toolchain mismatch. What it does not give you is a frozen provider backend, frozen cloud defaults, or frozen service semantics. If your configuration relies on those, you are still exposed.

Where the illusion breaks

The failure mode usually starts with an implicit dependency that nobody wrote down. Maybe an Azure API starts returning a new default. Maybe AWS tightens validation on a field that used to be tolerated. Maybe a managed service changes how it derives a setting when you omit it. The provider has not changed, but the response it receives from the platform has.

That creates a nasty gap:

flowchart TD
  A[IaC code] --> B[Pinned provider]
  B --> C[Cloud API]
  C --> D[Service defaults]
  D --> E[Live behavior]
  B --> F[State file]
  F -. misses .-> E

The state file records what the tool believes it created or last observed. It is not a full behavioral contract for the cloud service. When runtime behavior changes underneath an omitted field or computed value, your state can remain quiet while your service gets louder.

The kind of incident this creates

I have seen this most often in shared platform modules where teams trust defaults because they want the module surface to stay small. The sequence is usually boring right up to the point it is expensive: We assume the provider default is stable enough. We omit the field to keep the module ergonomic. We review the plan and see no obvious risk. We deploy and watch a downstream symptom appear first: more 502s, new timeouts, pod scheduling changes, missing routes, a firewall behavior shift.

Then the debugging gets ugly, because nothing in the Git diff explains the symptom cleanly. The turning point is usually a concrete signal outside Terraform or OpenTofu. A request latency jump. A load balancer metric. A control-plane audit log showing a field value that no one ever set explicitly. That is when you realize the system changed at a layer your version pin never covered.

The dead ends teams try first

The first reaction is usually to pin harder. Teams tighten provider constraints from broad ranges to exact versions. They mirror plugins internally. They freeze module tags more aggressively. Those are sensible controls, but they still miss the main issue if the risky behavior lives in defaults, computed attributes, or remote API interpretation.

The second dead end is trusting plan as if it were an oracle. A plan is a projection based on what the provider can infer at that moment. It is not an exhaustive proof that runtime behavior will match last month.

The third dead end is blaming drift on operator mistakes alone. Sometimes drift is operational sloppiness. Sometimes it is simply the result of building on APIs that keep moving.

The useful reframing

The reframing that helped my teams was simple: IaC is not a snapshot of infrastructure. It is a client for a changing API. Once you accept that, the engineering response changes. You stop asking, “Did we pin the provider?” as if that settles the risk.

You start asking, “Which parts of this resource contract are explicit, which parts are computed, and which parts are silently delegated to the cloud?”

That is the question that actually predicts operational surprises.

What to change in practice

The first move is to reduce implicit behavior in critical paths. If a setting affects availability, routing, identity, encryption, retention, scaling, or network exposure, write it down explicitly even if the provider can infer it.

A small example:

resource "aws_lb_target_group" "api" {
  name        = "api-prod"
  port        = 443
  protocol    = "HTTPS"
  target_type = "ip"

  health_check {
    path                = "/healthz"
    matcher             = "200-399"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 2
  }
}

This is more verbose than leaning on defaults. That is the point. Verbosity buys you a real contract.

The second move is to treat direct API observation as part of validation. After apply, query the live resource attributes that matter most and compare them against intended values. This can be a lightweight smoke script, a policy check, or an integration test in a non-prod environment. The key is that it reads from the cloud service, not only from IaC state.

The third move is to design modules around explicit inputs for high-risk behavior. A “simple” module that hides too many decisions behind omitted fields is easy to consume and hard to trust.

Operational hardening

Once you stop pretending the control plane is static, a few operational practices become non-optional.

Rollback needs to include more than reverting the IaC commit. If the remote service behavior changed, rolling back code may not restore the old runtime semantics. You need a tested fallback such as an alternate listener, a previous service SKU, a safer default policy set, or a manual override runbook.

Observability has to sit close to the infrastructure edge. For networking and identity resources, watch the signals that reveal semantic change quickly: rejection rates, auth failures, route mismatches, target health, and API audit events.

Security needs the same reframing. Pinning versions helps with supply chain integrity, but it does not guarantee that the remote platform still enforces the same meaning. If a compliance control depends on a default, it is not really a control you own.

Developer experience is the trade-off teams usually resist. Explicit contracts make modules noisier. They add more inputs, more documentation, and more review overhead. That cost is real. It is still cheaper than explaining a production incident whose root cause was “the cloud default moved and we never modeled it.”

What I would standardize on a platform team

If I were tightening this across a platform estate, I would standardize four things:

  1. Exact provider and module pinning, because toolchain consistency still matters.
  2. A review rule that any critical runtime behavior must be explicit in code.
  3. Post-apply verification for high-blast-radius resources using live API reads.
  4. Drift discussions framed around contracts and defaults, not only around state files.

None of this makes cloud infrastructure immutable. That is not an achievable goal on managed platforms. What it does is shrink the gap between what your code says, what your tool thinks, and what the service actually does.

Closing reflection

The phrase “immutable infrastructure” trained a lot of us to expect more determinism than cloud control planes can honestly provide.

That does not mean IaC is unreliable. It means the abstraction has edges, and those edges matter most in the exact places platform teams care about: safety, repeatability, and incident reduction.

Pin versions. Keep doing it.

Just stop confusing version pinning with behavioral immutability.

The useful question for your next module review is not “did we lock the provider?” It is “what are we still leaving up to a remote system we do not version ourselves?”

Final takeaways

  • Pinning providers and modules reduces toolchain drift, not cloud API drift.
  • Any critical behavior left to defaults is behavior you do not fully control.
  • plan is necessary, but it is not proof that runtime semantics stayed stable.
  • High-risk resources need explicit configuration plus live post-apply verification.

Related posts