March 4, 2026 • 15 min read
Argo CD 3.3 Made GitOps Deletions Safer, But Only If You Stop Treating Delete as "Sync With Extra Steps"
Argo CD 3.3 introduces safer deletion primitives, but teams still need explicit teardown governance, approvals, and sequencing.
The first time I accidentally “deleted production” with GitOps, it did not look dramatic.
No red screens. No Kubernetes API storm. No heroic terminal session. Just a quiet Argo CD sync that went green while a namespace slowly emptied itself.
The postmortem title was boring too: “Unexpected pruning after repo change.”
The lesson was not.
Because most teams do not actually have a deletion workflow. They have a deployment workflow, and then they reuse it for destruction.
Argo CD 3.3 is the first release in a while that feels like it is trying to force that conversation. Not by yelling about security, but by shipping features and upgrade guidance that make you confront day two operations. Who gets to destroy. How destruction is sequenced. What happens when Argo manages itself. And how many of us have been calling delete “sync with extra steps” as if that made it safer.
This post is about what changed in 3.3, what it reveals about our habits, and the workflow I now wish I had before GitOps was allowed anywhere near production teardown.
The moment the problem became real
We were cleaning up “old stuff.”
A platform cleanup sprint. Remove a few unused apps, retire a couple of Helm charts, merge some manifests. Nothing scary. The pull request was reviewed. CI passed. Argo showed green.
Then a teammate asked why a StatefulSet’s PVCs were stuck in Terminating.
That is when we realized the change was not “cleanup.”
It was a deletion event, triggered by a merge, executed by a controller, applied to a live cluster, with all the same blast radius as kubectl delete.
The gap was governance, not tooling.
We had approvals for merging to main. We did not have approvals for destructive runtime actions. We had audits for who changed Git. We did not have an explicit policy for who is allowed to cause Argo CD to remove resources from a cluster.
We secured GitOps by locking down write access to Git, and then we forgot that deletion is still a runtime action. Git is an interface, not a safety mechanism.
Why deletion is a different class of risk
Deployments fail loudly. They page you. They create error budgets and customer pain in the obvious direction.
Deletions fail quietly. They succeed with a straight face. Sometimes they “partially succeed” in the worst possible way: the app is gone, but the data is half gone, and the finalizers are stuck.
Deletion also has a broader dependency graph. Deploy is usually additive. Delete is subtractive, and subtraction exposes every undocumented coupling you have:
- Shared CRDs that multiple apps rely on
- Namespaces that contain resources not owned by Argo
- Service accounts referenced by Jobs and hooks
- External systems that need a drain, a backup, a notification, a lease release
- RBAC and admission policies that block deletion in surprising ways
In GitOps, the trigger is deceptively simple. Remove a manifest from Git, or delete an Argo CD Application, and the controller does what it was designed to do. Reconcile reality to the desired state.
If you have pruning enabled, “desired state” includes “this thing should no longer exist.”
That is not a sync. That is a teardown.
System context and constraints (the reality I assume you are living in)
I am going to assume a fairly standard platform setup:
- Multi-tenant clusters with shared platform components
- Argo CD deployed in a central namespace, managing dozens to thousands of Applications
- A mix of Helm and Kustomize, plus some raw YAML
- Argo CD has elevated permissions (because it has to deploy things)
- Some teams run “Argo managing itself” so upgrades are Git-driven
- Security wants strong controls, engineering wants speed, and both want fewer incidents
If your environment is simpler, the principles still apply. If your environment is more complex, you already know the scary parts and this will feel uncomfortably familiar.
The first attempts and dead ends
We tried the usual moves.
We restricted who could merge to main. That helped, but it did not solve the problem because “merge access” is not “delete authority.” The same engineer who is trusted to ship a safe config change is not automatically trusted to tear down an environment.
We tried relying on “Argo shows diff” as a safety check. That sounds good until you realize what diffs do not show well:
- cascading effects of ownerReferences
- finalizer behavior and deletion propagation
- what gets pruned in which order
- what external systems will break when this disappears
We tried telling people “don’t prune certain resources.” That worked until it didn’t, because the guardrail was tribal knowledge. Tribal knowledge is a runtime dependency that is not versioned, not tested, and not monitored.
Then we had the classic misconception: “If it is in Git, it is safe.”
No. If it is in Git, it is reproducible. Safety is a workflow, plus explicit policy, plus a way to stop the controller when you are wrong.
The key insight: Git is the trigger, Argo is the actor
This was the mental shift that changed our process.
In GitOps, humans do not delete resources. They commit intent. The controller performs the act.
That sounds philosophical, but it is operationally important. Because it means you should model destructive events like you model privileged actions performed by automation:
- Explicit permissions
- Explicit approvals
- Explicit sequencing
- Observability and auditability
- Ability to pause, confirm, or abort
Argo CD 3.3 is pushing in that direction. Not because it suddenly became “security tooling,” but because the project is maturing into day two reality: deployments are table stakes, operations is where teams bleed.
Two concrete signals from the 3.3 release line reinforce that:
- Argo CD 3.3 introduces PreDelete hooks as a first-class lifecycle phase, letting you block deletion until required jobs succeed.
- The 3.3.0, 3.3.1, and 3.3.2 releases include upgrade guidance that is explicitly dangerous for “Argo managing itself” unless you apply the right sync options and apply modes.
That second point is not about deletion, but it is about the same underlying truth: Argo is an actor with power, and defaults can hurt you if you treat everything as “just another sync.”
What changed in 3.3 that actually matters for safer deletions
PreDelete hooks make teardown a sequence, not a single moment
Before 3.3, if you wanted safe teardown you stitched together scripts, manual checklists, or Kubernetes finalizers. It worked, but it was fragile and opaque.
PreDelete hooks turn deletion into a lifecycle phase inside Argo CD. You can define a Kubernetes resource, typically a Job, that must run and succeed before Argo proceeds with deleting the rest of the Application’s resources. If the hook fails, deletion is blocked.
That is the right primitive. Not because it solves governance, but because it gives you a native stop point.
Here is a minimal example of a PreDelete hook Job:
apiVersion: batch/v1
kind: Job
metadata:
name: predelete-backup
annotations:
argocd.argoproj.io/hook: PreDelete
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
restartPolicy: Never
containers:
- name: backup
image: alpine:3.20
command:
- sh
- -c
- |
echo "Replace this with a real backup, export, or drain step"
exit 0
The point is not the script. The point is that deletion now has a gate you can see in Argo.
If you have stateful workloads, this is the difference between “hope someone took a backup” and “deletion does not proceed until the backup job says it did.”
It also forces you to think about what “safe to delete” means for your organization. That definition is different for stateless services, databases, queues, and shared platform components.
Prune confirmation is a surprisingly effective friction point
If you have pruning enabled, removing manifests from Git can delete resources from the cluster. That is powerful, and it is also a footgun.
Argo CD supports a confirmation mode for pruning via sync options. When enabled, a sync that would prune will pause and require an explicit confirmation step.
This is not perfect. It is still an operator action, and you can still click the button without thinking. But it creates friction in the exact moment you want it: when the controller is about to delete.
A common pattern is to enable prune confirmation on applications that represent production workloads, or on resource types that are high-risk to remove.
A minimal example:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-prod
namespace: argocd
spec:
project: prod
destination:
server: https://kubernetes.default.svc
namespace: payments
source:
repoURL: https://github.com/example/platform-apps.git
targetRevision: main
path: apps/payments/prod
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- Prune=confirm
- PrunePropagationPolicy=foreground
Two notes on this:
Prune confirmation is about human intent. Foreground propagation is about correctness. Foreground deletion blocks until dependents are gone, which can be safer when you want predictable teardown, but it can also make deletion slower and more likely to get stuck on finalizers.
You should choose the propagation mode deliberately. Background can be faster. Foreground can be more deterministic. Orphan can preserve dependents, which can be either a rescue tool or a data leak waiting to happen.
The key is that you do not want to discover your deletion semantics during an outage.
“Argo managing itself” upgrade guidance is a reminder that apply modes matter
Argo CD 3.3’s release notes and follow-up patch releases include guidance for the case where an Argo CD Application manages the Argo CD installation itself.
If you run that pattern, the upgrade path is not just “bump the version.” You have to ensure the Argo CD Application that owns Argo CD is configured with the correct apply behavior. The GitHub release notes for 3.3.0, 3.3.1, and 3.3.2 call this out explicitly, including the need for server-side apply flags in the install commands and related sync options, and the 3.3.2 notes include documentation updates about a ClientSideApplyMigration setting.
This matters in a deletion post because it is the same category of failure: treating privileged controller actions as if they were normal sync operations.
Apply mode mismatches are not glamorous, but they are how you brick your own delivery system. And if your delivery system is the one enforcing deletion workflows, you have just created a failure mode where you cannot safely change policy or stop a destructive action.
If Argo is self-managed in your org, treat upgrades as production changes with explicit runbooks and rollback plans. If Argo is the thing that enforces governance, do not gamble with how it is upgraded.
Implementation walkthrough: a safer deletion workflow that feels realistic
I do not think there is one magic config switch that makes deletions safe. It is a workflow. Here is what has worked for teams I have seen succeed at this, including my own scars.
Step 1: Classify Applications by deletion risk, not by ownership
Most orgs classify by team. For deletion, classify by blast radius:
- Stateless app, easy to recreate, no external dependencies
- Stateful app, data or external integration, requires a teardown sequence
- Shared platform component, dependencies unknown, deletion requires platform approval
That classification should map to Argo CD Projects and RBAC. Not as bureaucracy, but as a way to express intent in the platform.
Step 2: Require explicit confirmation for pruning in production classes
For category 2 and 3, enable prune confirmation. Then treat the confirm action as a privileged action, not as a normal UI click.
If you want to go further, you can remove automated pruning entirely for production apps and make pruning a manually triggered operation. That trades speed for correctness. In my experience, it is often worth it for stateful systems.
Step 3: Add PreDelete hooks for anything stateful, and make them fail fast when unsafe
A PreDelete hook should do one of two things:
- perform required safety actions (backup, export, drain)
- validate safety conditions (no active sessions, migration completed, retention policy set)
It should not be a dumping ground for “cleanup scripts.” Keep it small and observable. If it fails, deletion is blocked, which is exactly what you want.
This is where you also discover your operational edge cases. For example, if your PreDelete Job needs a service account, you need to ensure it exists during deletion. There are active issues and real-world reports where hook-related resources can behave unexpectedly if you assume everything from normal sync applies during deletion. Design for that by keeping the hook dependency surface area minimal.
Step 4: Put “delete intent” behind a separate Git path or separate repo
This is the part most teams skip, and it is the most effective governance move I know.
Do not make deletion the absence of a file. Make deletion a presence of a signal.
Instead of deleting apps/payments/prod, add apps/payments/prod/.delete-approved or a similar explicit marker.
Then use a small controller or policy mechanism to enforce that marker is required before Argo is allowed to prune.
If you are allergic to custom controllers, you can still implement the intent separation structurally:
- one repo for desired state
- one repo for deletion approvals, owned by platform or security
- Argo reads both, and deletion requires both signals
This is not about control. It is about making destructive intent explicit and auditable.
When deletion is implicit, mistakes look like “cleanup.” When deletion is explicit, reviews read differently.
Validation: how we knew the workflow worked
You do not validate deletion safety by reading YAML. You validate it by rehearsing.
We built a teardown rehearsal in a non-prod cluster that mirrored production policies:
- same Argo Project boundaries
- same pruning settings
- same hook logic
- same apply modes
- same external dependencies mocked or pointed at a sandbox
Then we tested three scenarios:
- Remove a manifest from Git without deletion approval
- Request deletion approval but break the PreDelete hook
- Provide deletion approval and run a clean teardown
What we looked for was not “did it delete.” We looked for signals:
- Argo should pause pruning and show a clear “waiting for confirmation” state
- PreDelete hook failure should block deletion and show obvious error surface
- Audit logs should show who confirmed pruning and who changed Git
- The cluster should end in a predictable state, including PVC behavior and finalizer handling
If you cannot describe what “predictable end state” means, you do not have a deletion workflow. You have hope.
Tradeoffs and alternatives I still consider
There is no free lunch here.
Prune confirmation adds friction. Engineers will complain. Some of those complaints are valid if your workflow is clunky. Make the safe path fast, or people will route around it.
PreDelete hooks add operational surface area. They are Jobs that can fail, hang, or require permissions. They can also become a new place where secrets and credentials accumulate. Keep them minimal and review them like production code.
Separating delete intent from desired state adds process. It is worth it for high-risk systems, and overkill for ephemeral dev environments. Apply it selectively.
And there is a real alternative: do not use GitOps for deletion at all. Some orgs treat GitOps as deployment only, and require an explicit out-of-band operational procedure for teardown. That can be safer in highly regulated environments, but it also increases drift between what is in Git and what exists in clusters.
My bias is that GitOps can handle deletion, but only if you treat it as a privileged operation with explicit workflow. Argo CD 3.3 makes that easier, but it does not do it for you.
Production hardening and edge cases
A few practical things that will bite you if you ignore them:
Finalizers and stuck deletions are not rare
If a resource has a finalizer that cannot complete because a controller is gone, your deletion can hang indefinitely. That is common with CRDs, admission controllers, and operators.
Have a runbook that answers: Who is allowed to remove finalizers manually? What evidence is required? How do we ensure we are not leaving orphaned cloud resources?
CRDs are a special kind of blast radius
Deleting a CRD can cascade into deleting all CRs, depending on how things are structured. Even when it does not, it can break every dependent controller.
Protect CRDs from pruning unless you have a very deliberate decommission plan. If your platform installs CRDs, treat them as shared infrastructure, not as “part of the app.”
Observability needs a deletion lens
Track deletion events like you track deployments:
- rate of prunes
- number of resources deleted per sync
- time spent in deletion states
- number of stuck deletions and common finalizer reasons
If you do not measure deletion behavior, you will only notice it when something goes missing.
Lessons learned (mental models that lasted)
1. Reconciliation is power, not safety
GitOps gives you deterministic reconciliation. It does not give you safe intent. Safety is policy and workflow.
2. Deletion is a lifecycle, not an action
If you cannot express “before delete” steps, you are not ready to automate teardown. PreDelete hooks are valuable because they make that lifecycle explicit.
3. Make destructive intent explicit
The absence of a file is a terrible way to represent a privileged decision. Use explicit markers, explicit approvals, and explicit confirm steps.
4. Treat controllers as actors with authority
Argo CD is not a passive tool. It is a privileged operator. Design governance like you would for any automated system with power to destroy.
Closing reflection
Argo CD 3.3 did not magically make deletions safe. What it did was expose a cultural gap many of us were living in.
We built strong pipelines for creation. We built weak workflows for destruction. Then we acted surprised when the dangerous part behaved exactly as designed.
If you take one thing from this, let it be this: stop treating delete as “sync with extra steps.”
Treat it as a privileged runtime operation. Give it gates. Give it sequence. Give it observability. And make intent explicit enough that your future self can read a pull request and immediately understand the blast radius.
If you have a deletion story you still remember with a cold feeling, I would genuinely like to hear it. Those stories are where the real platform patterns come from.
Final takeaways
- Use PreDelete hooks to make teardown a visible, blockable lifecycle phase.
- Add prune confirmation and deliberate propagation modes for high-risk apps.
- Separate delete intent from desired state so destruction is explicit and auditable.
- Rehearse deletion like you rehearse disaster recovery, because it is the same class of risk.