March 11, 2026 • 11 min read

Beyond the CSI Driver: Re-architecting Secret Delivery So Pod Startup Survives Provider Trouble

CSI-mounted secrets are clean, but they turn secret retrieval into a startup dependency. Here is a more resilient secret delivery pattern for Kubernetes platforms.

kubernetes

The first sign that our secret management design was wrong was not a breach.

It was a routine scale-out event that should have been boring. Traffic spiked, the HPA asked for more pods, and the new replicas sat in ContainerCreating while the old ones burned through their queue budget.

Nothing was wrong with the image. Nothing was wrong with the Deployment. The blocker was secret retrieval.

We had tied pod startup to a live round-trip through a cloud secret store, and that dependency chose the worst possible moment to remind us it was still a network call.

This is the trap with CSI-based secret delivery. It feels local because the application reads a file from the filesystem. Operationally, it is still remote if the file only exists after a provider call succeeds during mount.

That distinction matters more now because more teams are depending on cloud secret stores for everything from database passwords to API signing keys, and more platforms are using autoscaling as the first line of defense during incident load. If new pods cannot start during provider latency, your secret architecture stops being a security choice and starts becoming an availability bottleneck.

flowchart LR
  A[Pod start] --> B[CSI mount]
  B --> C[Provider API]
  C -->|slow or throttled| D[Pending pod]

  E[ESO sync] --> F[Kubernetes Secret]
  F --> G[Pod start]
  E --> C
  C -->|last good value retained| F

When the problem became real

The incident was small enough that nobody outside the platform team noticed. That made it more useful.

One provider region was having intermittent auth and latency trouble. Not a full outage, just enough pain to make retries expensive. Our workloads that already had secrets mounted stayed healthy. New pods were the problem.

The signal that changed the conversation was simple: pod startup time split into two populations. Warm pods came up in the expected window. Pods that needed a fresh secret mount stalled long enough to miss rollout expectations and autoscaling targets.

We did not need a dramatic outage report to understand what that meant. During a larger event, the cluster would have spare CPU, spare memory, healthy scheduler capacity, and still fail to add useful replicas because secret delivery sat on the critical path.

That is not a secret management bug. That is an architecture bug.

Why the common pattern feels safer than it is

The default argument for CSI or just-in-time secret fetch is understandable. You avoid copying secret values into Kubernetes Secret objects, you get provider-native audit trails, rotation has a clear source of truth, and developers still read a mounted file and move on.

All of that is real. The problem is where the dependency lands.

In the synchronous model, pod readiness depends on the node mounting the volume correctly, the identity path working, the provider API responding, throttling not kicking in, and the CSI plugin finishing its retry logic before Kubernetes gives up on startup.

That stack may be acceptable for low-frequency admin workloads. It is a much riskier bet for services that scale under load or recover by replacing pods quickly.

We initially assumed cache behavior inside the provider SDKs and driver layers would reduce that risk enough. That assumption did not hold in the scenarios we actually cared about. Startup still required a live success path, and “usually fast” is not the same thing as “safe during turbulence.”

The dead ends we tried first

We did the obvious things before changing the architecture.

First, we increased timeouts. That reduced some noisy failures and made the graphs look calmer, but it only changed the shape of the problem. Pods still waited on the provider, just for longer. That is not resilience. It is a slower failure mode.

Then we increased retries. That helped when the issue was a short transient blip. It made things worse when the provider was already stressed because every scale event multiplied the request volume. Retrying your dependency into a deeper rate limit is a familiar distributed systems mistake, and we made it here too.

We also discussed keeping a static fallback in Kubernetes Secret objects for emergencies. That would have improved startup reliability, but it created a second distribution path, a second rotation path, and a real risk that the fallback would become the de facto production path because nobody would want to remove it later.

None of those moves addressed the core problem. We were still treating secret retrieval as a startup activity instead of a background maintenance activity.

The turning point

The breakthrough was noticing that we did not actually need fresh provider connectivity at pod start. What we needed was a locally readable, recently synced, policy-checked secret value at pod start.

Those are not the same requirement.

Once we separated them, the design became clearer. Pod startup should depend on an on-cluster artifact that is already present. Provider communication should happen off the startup path in a controlled sync loop with rate limiting, backoff, metrics, and a last-known-good state.

That changed the design question from “how do we make CSI more reliable?” to “what is the safest place to hold the current usable secret state between provider refreshes?”

For many teams, External Secrets Operator is the simplest answer because it syncs provider values into a Kubernetes Secret before the workload starts. For us, the answer was a pod-local cache maintained by an agent because we wanted the application to read from a file contract without keeping the live provider call on the mount path. The core idea is the same in both models: provider access moves into background reconciliation instead of runtime startup.

The architecture we moved toward

The new pattern is not exotic. It is just more honest about dependencies.

There are two practical ways to do this. One is to run an agent that keeps a local cache for the workload. The other is to use External Secrets Operator to keep a Kubernetes Secret current in the background.

In both cases, something inside the cluster authenticates to the external secret store and refreshes values on a schedule or watch loop. That component fetches and validates the secret, writes the current version into an on-cluster location, and keeps serving that state to the application at startup and on reload. If refresh fails, the last known good value remains available while the system raises alerts when freshness crosses a threshold.

The application no longer waits for the provider. It waits for an on-cluster dependency such as a local file or a Kubernetes Secret, which is the kind of dependency Kubernetes is much better at handling.

This does introduce eventual consistency. That is the trade. You are deliberately accepting bounded staleness in exchange for removing an external API from the startup path.

For most production systems, that is a better trade than pretending every secret must be fetched live or not used at all.

Practical implementation details

There are a few ways to implement this pattern, but the important characteristics are the same regardless of vendor.

If you are already comfortable with Kubernetes Secret objects, External Secrets Operator is usually the cleanest fix for the exact failure mode in this post. It does not make pods call the provider directly. It reconciles in the background and writes the result into a Kubernetes Secret, so pod startup depends on the cluster state that already exists instead of on a live provider round-trip.

A minimal ESO example looks like this:

apiVersion: external-secrets.io/v1
kind: SecretStore
metadata:
  name: aws-secretsmanager
  namespace: payments
spec:
  provider:
    aws:
      service: SecretsManager
      region: ap-southeast-2
      auth:
        jwt:
          serviceAccountRef:
            name: eso-sa
---
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: payments-db
  namespace: payments
spec:
  refreshPolicy: Periodic
  refreshInterval: 15m
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: payments-db-secret
    creationPolicy: Owner
    deletionPolicy: Retain
  data:
    - secretKey: username
      remoteRef:
        key: /prod/payments/db
        property: username
    - secretKey: password
      remoteRef:
        key: /prod/payments/db
        property: password
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: payments
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: app
          image: ghcr.io/example/payments-api:1.0.0
          env:
            - name: DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: payments-db-secret
                  key: username
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: payments-db-secret
                  key: password

In that model, the important trade-off is clear. If the provider has a bad fifteen minutes but the last sync succeeded, new pods can still start. If the secret has never been synced or a rotation must be immediate, the operator path still has limits you need to design around.

The cache needs ownership semantics. If it is shared between containers in a pod, use a writable in-memory volume or a tightly controlled filesystem path with explicit UID and GID expectations. If it is node-local, you need to think harder about tenant isolation and cleanup on pod reschedule.

The cache also needs metadata, not just the secret value. At minimum we tracked the secret version or provider version marker, the last successful sync timestamp, the last attempted sync timestamp, and refresh status. That let the application expose useful health signals without logging secret material.

A minimal example of the consumer side looked like this:

apiVersion: v1
kind: Pod
metadata:
  name: payments-api
spec:
  volumes:
    - name: secret-cache
      emptyDir:
        medium: Memory
  containers:
    - name: app
      image: ghcr.io/example/payments-api:1.0.0
      volumeMounts:
        - name: secret-cache
          mountPath: /var/run/secrets-cache
          readOnly: true
      env:
        - name: DB_PASSWORD_FILE
          value: /var/run/secrets-cache/db-password
    - name: secret-sync
      image: ghcr.io/example/secret-sync:1.0.0
      volumeMounts:
        - name: secret-cache
          mountPath: /var/run/secrets-cache

The application contract is intentionally simple. Read a file. If the file is missing or older than the allowed freshness budget, expose that through health and metrics. Do not make the app reach directly back into the provider as a hidden fallback, or you quietly reintroduce the same failure mode.

What changed operationally

This architecture gave us better failure isolation, but it also created work we had to own.

Observability got sharper. Instead of only watching provider API errors, we started tracking cache age, sync lag, refresh success rate, and the count of pods starting with stale-but-valid secrets. Those are the signals that tell you whether the design is actually protecting the workload.

Rollback became easier in one dimension and harder in another. A bad application deploy no longer combined with provider latency into a compound failure because secret delivery was already local. But secret rotation mistakes could now persist in the cache until the next successful sync or an explicit rollback action. That meant rotation needed version pinning and a clear revert path, not just “update the provider and trust propagation.”

Testing also changed. We stopped validating secret delivery by running happy-path integration tests alone. We added scenarios where provider auth was denied, provider latency was injected, and refreshes returned malformed data. The important question became: does the workload keep starting from the last known good state, and do operators get a clean signal that freshness is degrading?

Security trade-offs that matter

Caching secrets locally makes some engineers uneasy for good reason. You are increasing the lifetime of decrypted material on the node or in the pod sandbox. If you implement this lazily, you can absolutely make your security posture worse.

The guardrails need to be explicit.

Keep the cache in memory where practical. Lock down file permissions so only the workload identity can read what it needs. Encrypt at rest if the cache can touch disk. Never log the value, the raw payload, or full provider responses. Treat the sync agent as privileged code with a smaller blast radius than the application, not as a convenience sidecar nobody reviews.

Most importantly, decide the freshness budget per secret class. A database password used for steady-state app traffic can usually tolerate a short bounded lag. A one-time credential used for a high-risk control-plane action may not. Not every secret belongs in the same delivery path.

Alternatives I would still choose in some cases

I would not claim one model should replace every CSI deployment.

If a workload is low-scale, non-critical, and benefits more from direct provider semantics than from startup resilience, the simpler CSI design may be fine. If you already have External Secrets Operator syncing into Kubernetes Secret objects, that often solves the exact startup-path issue more simply than inventing a custom cache layer. If you need file-based contracts, tighter per-pod refresh control, or you want to avoid distributing plaintext through Kubernetes Secret objects, a local cache agent can still be the better answer.

The point is not that CSI is bad. The point is that synchronous external dependency at pod startup is a reliability decision whether you mean it to be or not.

If the service is part of your scale-out path, recovery path, or incident containment path, I would bias strongly toward a local-read architecture.

Lessons that lasted

The main lesson was embarrassingly basic: mounted files are not local if they only appear after a remote dependency succeeds.

The second lesson was that secret rotation and secret availability are different concerns. We had optimized heavily for rotation elegance and underweighted the runtime behavior of the application during dependency stress.

The third lesson was that last-known-good state is not always a compromise. Sometimes it is the control that turns an external dependency from a hard failure into an operational warning.

Closing reflection

Platform teams like clean abstractions, and “the app reads a file” is a very clean abstraction. What matters is whether the operational dependency graph matches the abstraction.

In our original design, it did not. The file looked local, but startup still depended on the provider, the network, identity plumbing, and driver behavior all being healthy in real time.

Re-architecting secret delivery was less about replacing a tool than about moving the dependency to the right side of the timeline. Fetch in the background. Start from local state. Alert on freshness. Make the degraded mode visible long before it becomes an outage.

If your platform currently needs the secret store to be healthy before new pods can even start, that is the discussion worth having. How much startup availability are you willing to trade for live retrieval purity, and is that trade actually intentional?

Final takeaways

If pod startup requires a live secret provider call, secret management is part of your availability path.
Timeouts and retries can soften transient issues, but they do not remove the startup dependency.
A background sync plus local cache turns provider trouble into bounded staleness instead of blocked scaling.
Local caching only works if you add strict permissions, staleness telemetry, rollback paths, and rotation discipline.