We Stopped Trusting Deployments Until We Fixed This

Nothing destroys trust in deployments faster than realizing staging and production were never actually running the same thing.

Every engineering team eventually hears some version of:

"But it worked in staging."

Which usually translates to:

different image
different environment variables
slightly different dependencies
different permissions
and now three engineers are trying to determine whether production broke because of the code or because the deployment process quietly changed underneath it

That realization changed how I think about CI/CD entirely.

Not branches. Not tooling. Not GitHub Actions. Not Kubernetes.

Artifact trust.

That became the entire game.

The Problem We Were Actually Solving

Our team ships containerized workloads into Kubernetes environments.

We needed:

governance around production releases
executive approval before production deployment
auditability
reliable testing
and enough automation that engineers could still move quickly

Like most teams, we started with branching strategies.

np         → active development
rc         → release candidate and stabilization
production → live environment

If we didn't have governance and approval requirements, trunk-based deployment would likely be simpler operationally.

But eventually I realized something important:

Branches themselves do not make deployments safe.

You can have beautiful Git workflows, release candidates, change requests, approvals, and perfectly named branches — and still have completely unreliable deployments if your artifacts drift between environments.

That's the real failure mode.

The Rule That Changed Everything

We adopted one simple rule:

Build once. Promote the same image everywhere.

No rebuilding containers between environments. No "slightly different" production artifact. No environment-specific build process.

The exact same image that runs in np becomes the image that eventually reaches production.

That one decision eliminated an enormous amount of deployment uncertainty.

We started calling this the slingshot pattern.

The Slingshot Pattern

When a PR merges into np, automation kicks off:

Run tests
Run vulnerability scans
Build the container image
Run image and manifest scans
Deploy into the np Kubernetes cluster
Promote the exact same image into the rc artifact location

Not deployed yet. Just staged and waiting.

That distinction matters.

A lot of CI/CD systems accidentally rebuild images during promotion between environments. Which means checksums differ, dependencies drift, debugging becomes painful, and "worked in staging" stops meaning anything.

We wanted deployments to feel boring and trustworthy.

Same image. Different clusters.

RC Became Our Confidence Layer

When changes move from np → rc, another workflow triggers:

Deploy the exact same image into the rc Kubernetes cluster
Run deployments using production credentials in dry-run mode
Generate a GitHub pre-release
Open an Organizational Change Request for approval
Promote the same image into the production artifact location

That production dry-run step ended up catching an incredible amount of operational weirdness:

missing permissions
IAM problems
secret access failures
environment mismatches
cluster-level surprises

Because nothing is more frustrating than: "The application worked perfectly but production permissions failed."

The earlier we caught those issues, the calmer deployments became.

Why We Didn't Deploy Directly From `np`

One fair question is:

"Why not just deploy directly from np using trunk-based workflows?"

And honestly, in a highly mature engineering organization, I probably would lean more in that direction.

But context matters.

The team I inherited was deploying workloads directly from laptops into environments through Cloud Run jobs — without standardized pipelines, release visibility, or strong deployment boundaries between environments.

np was effectively development, testing, deployment, and production approval all blended together.

So before we could optimize for deployment velocity, we first needed to establish:

trust in automation
trust in releases
repeatable deployment behavior
auditability
and safer operational patterns

Most of our workloads were not customer-facing frontend applications. They were internal SaaS automations, API integrations, monitoring services, seat provisioning workflows, identity lifecycle tooling, and operational support systems.

Which meant many failures would not immediately surface visually. A deployment could succeed technically while silently failing operationally.

The rc layer became less about adding process and more about creating operational trust — confidence, validation, and a real boundary between experimentation and production promotion.

And honestly, it did slow the team down at first.

But that was intentional.

You cannot build trust in automation by moving faster than the automation is capable of validating safely.

The release process was not only technical infrastructure. It was behavioral infrastructure.

Over time, as teams mature operationally, you can absolutely reduce ceremony and move toward more continuous deployment models. But inherited systems rarely start there safely.

The Goal Was Never More Automation

This is where I think a lot of platform engineering discussions go sideways.

People talk about CI/CD like the goal is maximum automation at all costs.

It's not.

The goal is trust.

A release process should make engineers feel more confident deploying software, not less. The automation only matters if it reduces uncertainty.

That's why the slingshot pattern worked so well for us:

one artifact
consistent promotion
predictable deployments
less drift
fewer surprises

It reduced the number of unknowns engineers had to reason about during releases.

The Most Fragile Part: Branch Divergence

The hardest part of the entire system wasn't Kubernetes. It wasn't GitHub Actions. It wasn't artifact promotion.

It was humans.

More specifically: hotfixes, release pressure, and branch divergence during stabilization windows.

Eventually rc would drift away from np:

production fixes landed directly in rc
last-minute changes happened under pressure
release-specific tweaks appeared
and suddenly the branches no longer represented the same reality

At first teams tell themselves: "We'll remember to merge it back later."

That works exactly until nobody remembers.

Then six weeks later: "Why did this production bug come back?"

We experimented with fully automating the back-merge process, but honestly, automatic conflict generation infrastructure is not as fun as it sounds.

What ended up working best was:

automatic visibility
notifications
lightweight tooling
and manual ownership when conflicts existed

Enough automation to reduce friction. Not enough automation to create invisible chaos.

What I Learned

The branches themselves don't make you safe. The automation and artifact consistency behind them do.

Once we committed to same image, different clusters — deployments became dramatically easier to reason about.

If production failed, we knew:

the artifact had not changed
the deployment pipeline had not changed
only the environment had

That level of confidence matters more than people realize.

Because at the end of the day, most engineers are not afraid of deployments.

They're afraid of uncertainty.

Good CI/CD systems reduce uncertainty.

That's the real job.