platform-notes

March 16, 2026 • 5 min read

Ephemeral Kubernetes Clusters Are for Ephemeral Problems

Per-PR Kubernetes clusters feel safe until control plane sprawl, slow startup times, and weak platform boundaries turn them into an expensive detour.

We’ve all seen the Jira ticket: “CI/CD pipeline failing due to API server timeouts.” You check the dashboard, and your cluster count has tripled in six hours. It’s the “Cluster-per-PR” fever, a trend that promises developer autonomy but often delivers a quiet, catastrophic burn on your cloud bill. We are spinning up entire Kubernetes control planes just to run a handful of integration tests, treating a sophisticated orchestration platform like a glorified, disposable shell script.

The Rise of the Ephemeral Fever

The push toward ephemeral clusters often facilitated by tools like vCluster or simple Terraform-driven automation came from a place of genuine frustration. We wanted to escape the “shared cluster” hell of the mid-2010s, where one stray kubectl delete by a junior engineer could take down the entire dev environment.

The promise was simple: give every PR its own clean-room environment. If a developer breaks something, they only break their own cluster. We traded stability for isolation, but we forgot to account for the overhead.

The Hidden Tax of Sprawl

I remember a specific week at a previous gig where our infrastructure costs jumped 40% without a single new customer onboarding. We were running 200+ ephemeral clusters for 40 engineers. Each one had its own API server, controller manager, and scheduler. That’s hundreds of idling control plane components, dozens of Load Balancer instances, and a distributed observability nightmare where logs were fragmented across hundreds of disparate sources.

We weren’t optimizing for speed. We were optimizing for poor isolation by masking our inability to build a robust, multi-tenant platform.

The Turning Point

The signal came not from a cost report, but from a developer experience survey. While developers loved the “safety” of their ephemeral clusters, they hated the startup time. Our “ephemeral” workflow was taking 12 minutes to provision the cluster and deploy the application. We had built a “clean” environment that was so slow it actually reduced the frequency of testing. That was the moment we realized: we were solving the wrong problem.

Namespace-as-a-Service: The Better Approach

Instead of chasing the “one cluster per human” dream, we pivoted to a high-density, multi-tenant model. We stopped asking “How do we isolate the environment?” and started asking “How do we isolate the workloads within a shared, high-availability control plane?”

This meant investing in:

  1. Hard multi-tenancy: Using Hierarchical Namespaces (HNC) and strictly enforced RBAC.
  2. Resource fairness: Implementing strict VPA/HPA limits to prevent “noisy neighbor” scenarios.
  3. Network policy: Using Cilium to enforce zero-trust traffic patterns between namespaces.
flowchart LR
  A[PR opened] --> B{Old path}
  B --> C[Create cluster]
  C --> D[Install addons]
  D --> E[Deploy app]
  A --> F{Better path}
  F --> G[Create namespace]
  G --> H[Apply policy pack]
  H --> I[Deploy app]

Implementation Walkthrough

The transition isn’t just about deleting clusters. It’s about building a robust platform that provides the illusion of a dedicated environment without the underlying waste.

We started by automating namespace creation. When a PR is opened, our controller doesn’t call Terraform to spin up a new kubernetes cluster; it calls the Kubernetes API to create a namespace. It then injects a set of predefined network policies, resource quotas, and role bindings. The application is deployed via Helm, and we use ExternalDNS to generate a unique subdomain on the fly.

The result? The same “isolated” experience for the developer, but the platform remains lean. We went from 12-minute provisioning times to under 30 seconds.

Dealing with the Hardening

Multi-tenancy in Kubernetes is famously complex. The biggest hurdle is the “blast radius” issue. If a developer deploys a custom resource definition (CRD) that conflicts with another, a shared control plane will feel the pain.

To mitigate this, we had to become more disciplined about cluster add-ons. We moved away from “install whatever you want” and moved toward “here are the five ingress controllers supported.” By standardizing the platform stack, we eliminated the majority of cross-tenant conflicts.

The Real Cost of “Cheap” Infrastructure

The community often argues that ephemeral clusters are “safer” because they eliminate config drift. While true, that’s a CI/CD hygiene problem, not an architecture problem. If your shared cluster is so “dirty” that you can’t run tests on it, your real problem is a lack of automation in your configuration lifecycle.

Don’t let the ease of spinning up a cluster blind you to the cost of maintaining its lifecycle, security patches, and observability.

Reflection

If you are a small team with a handful of microservices, by all means, spin up as many clusters as you want. Velocity is your primary currency. But if you are managing a platform for a growing engineering organization, stop treating Kubernetes clusters like temporary playthings.

Real platform engineering isn’t about giving developers a sandbox; it’s about giving them a reliable, performant, and shared engine that handles the complexity so they don’t have to.

Key Takeaways

  • Ephemeral clusters often mask a lack of investment in robust namespace multi-tenancy.
  • Cluster sprawl creates hidden observability and security overhead that scales linearly with your team size.
  • Namespace-as-a-Service offers the same developer experience as per-PR clusters with significantly lower compute and management costs.
  • Standardizing your platform stack is a prerequisite for moving away from cluster-per-PR models.

How are you handling PR environments in your current stack? Are you feeling the weight of the cluster tax yet?

Related posts