## The CI/CD lesson that mattered more than branching strategies.
Nothing destroys trust in deployments faster than realizing staging and production were never actually running the same thing.
Every engineering team eventually hears some version of:
> "But it worked in staging."
Which usually translates to:
- different image
- different environment variables
- slightly different dependencies
- different permissions
- and now three engineers are trying to determine whether production broke because of the code or because the deployment process quietly changed underneath it
That realization changed how I think about CI/CD entirely.
Not branches.
Not tooling.
Not GitHub Actions.
Not Kubernetes.
**Artifact trust.**
That became the entire game.
---
## The Problem We Were Actually Solving
Our team ships containerized workloads into Kubernetes environments.
We needed:
- governance around production releases
- executive approval before production deployment
- auditability
- reliable testing
- and enough automation that engineers could still move quickly
Like most teams, we started with branching strategies.
```text
np → active development
rc → release candidate and stabilization
production → live environment
```
If we didn't have governance and approval requirements, trunk-based deployment would likely be simpler operationally.
But eventually I realized something important:
Branches themselves do not make deployments safe.
You can have beautiful Git workflows, release candidates, change requests, approvals, and perfectly named branches — and still have completely unreliable deployments if your artifacts drift between environments.
That's the real failure mode.
---
## The Rule That Changed Everything
We adopted one simple rule:
**Build once. Promote the same image everywhere.**
No rebuilding containers between environments.
No "slightly different" production artifact.
No environment-specific build process.
The exact same image that runs in np becomes the image that eventually reaches production.
That one decision eliminated an enormous amount of deployment uncertainty.
We started calling this the slingshot pattern.
---
## The Slingshot Pattern
When a PR merges into np, automation kicks off:
1. Run tests
2. Run vulnerability scans
3. Build the container image
4. Run image and manifest scans
4. Deploy into the np Kubernetes cluster
5. Promote the exact same image into the rc artifact location
Not deployed yet. Just staged and waiting.
That distinction matters.
A lot of CI/CD systems accidentally rebuild images during promotion between environments. Which means checksums differ, dependencies drift, debugging becomes painful, and "worked in staging" stops meaning anything.
We wanted deployments to feel boring and trustworthy.
Same image. Different clusters.
---
## RC Became Our Confidence Layer
When changes move from np → rc, another workflow triggers:
1. Deploy the exact same image into the rc Kubernetes cluster
2. Run deployments using production credentials in dry-run mode
3. Generate a GitHub pre-release
4. Open an Organizational Change Request for approval
5. Promote the same image into the production artifact location
That production dry-run step ended up catching an incredible amount of operational weirdness:
- missing permissions
- IAM problems
- secret access failures
- environment mismatches
- cluster-level surprises
Because nothing is more frustrating than: *"The application worked perfectly but production permissions failed."*
The earlier we caught those issues, the calmer deployments became.
---
## Why We Didn't Deploy Directly From `np`
One fair question is:
> "Why not just deploy directly from `np` using trunk-based workflows?"
And honestly, in a highly mature engineering organization, I probably would lean more in that direction.
But context matters.
The team I inherited was deploying workloads directly from laptops into environments through Cloud Run jobs — without standardized pipelines, release visibility, or strong deployment boundaries between environments.
`np` was effectively development, testing, deployment, and production approval all blended together.
So before we could optimize for deployment velocity, we first needed to establish:
- trust in automation
- trust in releases
- repeatable deployment behavior
- auditability
- and safer operational patterns
Most of our workloads were not customer-facing frontend applications. They were internal SaaS automations, API integrations, monitoring services, seat provisioning workflows, identity lifecycle tooling, and operational support systems.
Which meant many failures would not immediately surface visually. A deployment could succeed technically while silently failing operationally.
The `rc` layer became less about adding process and more about creating operational trust — confidence, validation, and a real boundary between experimentation and production promotion.
And honestly, it did slow the team down at first.
But that was intentional.
You cannot build trust in automation by moving faster than the automation is capable of validating safely.
The release process was not only technical infrastructure. It was behavioral infrastructure.
Over time, as teams mature operationally, you can absolutely reduce ceremony and move toward more continuous deployment models. But inherited systems rarely start there safely.
---
## The Goal Was Never More Automation
This is where I think a lot of platform engineering discussions go sideways.
People talk about CI/CD like the goal is maximum automation at all costs.
It's not.
The goal is trust.
A release process should make engineers feel more confident deploying software, not less. The automation only matters if it reduces uncertainty.
That's why the slingshot pattern worked so well for us:
- one artifact
- consistent promotion
- predictable deployments
- less drift
- fewer surprises
It reduced the number of unknowns engineers had to reason about during releases.
---
## The Most Fragile Part: Branch Divergence
The hardest part of the entire system wasn't Kubernetes.
It wasn't GitHub Actions.
It wasn't artifact promotion.
It was humans.
More specifically: hotfixes, release pressure, and branch divergence during stabilization windows.
Eventually rc would drift away from np:
- production fixes landed directly in rc
- last-minute changes happened under pressure
- release-specific tweaks appeared
- and suddenly the branches no longer represented the same reality
At first teams tell themselves: *"We'll remember to merge it back later."*
That works exactly until nobody remembers.
Then six weeks later: *"Why did this production bug come back?"*
We experimented with fully automating the back-merge process, but honestly, automatic conflict generation infrastructure is not as fun as it sounds.
What ended up working best was:
- automatic visibility
- notifications
- lightweight tooling
- and manual ownership when conflicts existed
Enough automation to reduce friction. Not enough automation to create invisible chaos.
---
## What I Learned
The branches themselves don't make you safe. The automation and artifact consistency behind them do.
Once we committed to same image, different clusters — deployments became dramatically easier to reason about.
If production failed, we knew:
- the artifact had not changed
- the deployment pipeline had not changed
- only the environment had
That level of confidence matters more than people realize.
Because at the end of the day, most engineers are not afraid of deployments.
They're afraid of uncertainty.
Good CI/CD systems reduce uncertainty.
That's the real job.