← Blog

A Zero-Downtime Emergency Infra Migration, and the Optimizations Along the Way

InfraMigrationKubernetes

Background

The business side announced they were shutting down our existing Azure subscription. Everything running on it had to move — either to a new isolated Azure subscription or to a different cloud provider entirely — under a hard deadline, with the explicit constraint that users should not experience any downtime.

The scope:

  • 22 services, around 200 pods running on Kubernetes

  • Blob storage and Key Vault per tenancy

  • All supporting infrastructure (CI/CD, IaC, secrets, DNS)

What stayed put: the Kafka cluster and Elasticsearch — these sit in front of the architecture (CQRS-style, Kafka as the write model, Elasticsearch as the read model), are managed by their respective vendors, and weren't part of the subscription teardown. A colleague handled re-streaming between old and new Kafka clusters separately as part of the broader refactor; that's a story for another post.

What follows is the migration pattern that came out of it — and the long-overdue cleanup that happened alongside, because if you're going to touch every piece of infrastructure anyway, you might as well leave it better than you found it.

The Zero-Downtime Migration Pattern

Blob storage and Key Vault had the same fundamental constraint: client SDKs are tied to credentials, and there was no way to migrate them transparently from behind the service. The migration had to be visible to the application layer, but invisible to the user.

The pattern we landed on, applied to both:

1. Provision the new resource. New blob storage account, new key vault, in the new subscription.

2. First full sync. Copy everything from old to new. For blob storage this was azcopy — and because we kept the new account in the same region as the old one, the transfer rode Azure's internal network rather than the public internet. Fast, and crucially, no egress bandwidth charges. For Key Vault, no equivalent first-party tool exists, so a script.

3. Deploy a version of the service that holds both sets of credentials. Reads and writes can be routed to either side. The routing decision is controlled by a feature flag scoped per tenancy — so we can flip tenancies one at a time rather than all at once. Flipped tenancies now write to the new resource; non-flipped tenancies still write to the old one.

4. Flip tenancies progressively. Each tenancy gets verified after its flip. If something looks wrong, the flag flips back. Blast radius is exactly one tenancy.

5. Incremental sync after each flip. Between the initial full sync and the moment a tenancy was flipped, the old resource accumulated writes that the new one doesn't have. A 2nd sync — old to new, delta only — closes that window.

6. Cleanup. Once every tenancy is on the new side and stable, remove the old credentials, the feature flag, and the old resource.

The thing that makes this work is the tenancy-scoped feature flag. Without it, the migration becomes a single big-bang flip with no rollback granularity. With it, the migration is just a sequence of small, reversible operations.

The Cleanup That Came With It

A forced migration is also a free pass to fix things you've been wanting to fix. Two pieces of accumulated debt got addressed during this window.

CI/CD Pipeline Naming

The existing pipelines had no naming convention. Files had been added ad hoc over years, and the result was that almost nothing could be reused — you couldn't tell from a filename whether a pipeline was a template or an entrypoint, what it operated on, or what it was for. Every new service ended up duplicating an existing pipeline rather than reusing it.

The convention I introduced was a three-segment scheme: [Region]-[Type]-[Usage]. Where a segment needed multiple words, underscores joined them. The result:

infra-template-k8s.yaml           # template: deploys K8s-cluster-internal infra
infra-template-all.yaml           # template: composes all infra templates
infra-deploy-all.yaml             # entrypoint: deploys all infra
service-template-build.yaml       # template: builds a Docker image
service-template-deploy_kotlin.yaml   # template: deploys a Kotlin service
service-template-deploy_node.yaml     # template: deploys a Node service

With this in place, every service pipeline became "pick a template, supply inputs." Adding a new service stopped being a copy-paste exercise.

This turned out to be load-bearing for the service migration phase — see below.

Terraform Module Refactor

The Terraform layout was, to put it generously, hard to navigate. Modules were tangled, and there was no clear separation between environment-specific configuration and the reusable building blocks underneath. Changing anything meant a guessing game about which file actually owned what.

The refactored structure:

terraform/
├── envs/
│   ├── dev/
│   ├── staging/
│   └── production/
└── modules/
    ├── azure/
    ├── cloudflare/
    ├── confluent/
    └── etc.../

Reusable building blocks live in modules/, organized by provider. Environment-specific composition and configuration lives in envs/. Changing infrastructure now means editing the config in the relevant env directory — the modules themselves rarely need to be touched.

For Confluent specifically, the previous topic creation was a pile of scripts. As part of this refactor, topic definitions were pulled into pure Terraform, so the IaC fully describes the desired state of the messaging layer too.

The Kubernetes Cluster Move

The K8s migration had its own complication: the new cloud provider had a hard quota cap that takes time to lift. Given the deadline, "wait for quota" wasn't viable, so the selection criterion became which reasonably well-known provider could grant the quota we needed the fastest. The chosen provider isn't the point — the operational behavior of the cluster after migration is a longer story that I've written about separately.

The cutover itself was straightforward thanks to two properties of the workload:

Most services are event-driven. They consume from Kafka, do their work, and produce results. They don't have inbound user traffic, which means migration is purely a matter of standing up the consumer in the new cluster, letting both clusters run in parallel briefly, then draining the old one. No user-facing cutover moment.

One service has user-facing traffic — a legacy UI used by a small subset of users who haven't migrated to the serverless version that the majority of users are on. This one needed a real DNS cutover:

  1. Deploy the new cluster's version of the service, bound to a temporary domain.

  2. Verify it end-to-end against real traffic patterns.

  3. Switch DNS on the primary domain to point at the new deployment.

  4. Keep the old deployment running through the DNS propagation window — anyone still resolving to the old IP hits a working service.

  5. Drain and remove once traffic on the old side reaches zero.

The user-facing service experienced no downtime. The event-driven services experienced no downtime by definition — there's no foreground request to drop.

Because the pipeline refactor had already happened, migrating each service was largely mechanical: point the new templates at the new cluster, supply the inputs, deploy. The reusable templates absorbed almost all the per-service variability.

Result

Most of the 22 services were moved within a week, with no user-visible downtime. The remaining services followed shortly after, on the same pattern.

What actually shipped was more than the migration itself:

  • A reusable zero-downtime migration pattern for credential-bound resources (blob, vault, and anything else that fits the shape)

  • A CI/CD pipeline structure that turned new-service onboarding into a configuration change

  • A Terraform layout that separates "what changes per environment" from "what's reusable across them"

  • Topic management moved out of scripts and into IaC

Reflection

Two things stand out in hindsight.

The first is that the migration pattern is more valuable than the migration. The specific event — losing an Azure subscription — happens rarely. But the shape of the problem (move a credential-bound resource without downtime, with per-tenant rollback) recurs constantly: changing vendors, splitting accounts, complying with a new data residency requirement. The pattern, once written down, applies the next time too. The migration paid for itself once; the pattern keeps paying.

The second is that forced deadlines are a legitimate forcing function for cleanup. The pipeline mess and the Terraform mess had both been on the "should fix someday" list for a long time, and "someday" never arrived under normal business pressure. The deadline made the cleanup non-optional, because without it, doing the migration in time wasn't possible. The deadline didn't just cost time — it bought leverage to fix things that had been blocking the team for years.

If the cleanup hadn't happened, this would have been a story about heroically firefighting under a deadline. Because it did, it's a story about leaving the platform meaningfully better than it started.