GitOps at Scale: Lessons from 50+ Kubernetes Clusters

GitOps Is Simple. GitOps at Scale Is Not.

The pitch is elegant: declare your desired state in Git, and a controller reconciles reality to match. No more kubectl apply from laptops. No more "it works on my cluster." Full audit trail, easy rollbacks, and infrastructure-as-code for everything.

We've implemented GitOps across 50+ Kubernetes clusters for clients ranging from Series A startups to Fortune 500 enterprises. Here's what we've learned about making it work at scale.

Lesson 1: Mono-Repo vs. Multi-Repo — It Depends

The first decision every team agonizes over. Here's our guidance:

Mono-Repo (Recommended for < 20 services)

gitops-config/
├── base/
│   ├── namespaces/
│   ├── network-policies/
│   └── rbac/
├── apps/
│   ├── payment-service/
│   ├── auth-service/
│   └── notification-service/
└── clusters/
    ├── prod-us-east/
    ├── prod-eu-west/
    └── staging/

Pros: Single PR for cross-cutting changes, easy to see the full picture, simpler CI/CD Cons: Gets unwieldy past 20-30 services, blast radius of bad merges is larger

Multi-Repo (Recommended for > 20 services)

Each team owns their own config repo. A central "cluster config" repo references them.

Pros: Team autonomy, smaller blast radius, independent release cycles Cons: Harder to make global changes, more repos to manage, need tooling for cross-repo visibility

Our Recommendation

Start mono-repo. Split when it becomes painful (usually around 20 services or 5 teams). The migration is straightforward with Kustomize or Helm.

Lesson 2: Environment Promotion Done Right

The most common GitOps anti-pattern: manually updating image tags in config files across environments.

The Right Way: Automated Image Promotion

yaml

# Use Kustomize overlays for environment-specific config
# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
  - hpa.yaml

# overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
patchesStrategicMerge:
  - deployment-patch.yaml
images:
  - name: payment-service
    newTag: v2.3.1-rc.1    # Automatically updated by CI

# overlays/production/kustomization.yaml
# Same structure, but with production image tag
images:
  - name: payment-service
    newTag: v2.3.0          # Promoted from staging after validation

Promotion Pipeline

CI builds image → Updates staging config → ArgoCD syncs staging
                                          ↓
                              Integration tests pass
                                          ↓
                              PR auto-created for prod config
                                          ↓
                              Team approves → ArgoCD syncs prod

Lesson 3: Drift Detection Is Non-Negotiable

Someone will kubectl edit in production. It's not a matter of if, but when.

Configure ArgoCD for Self-Healing

yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
spec:
  syncPolicy:
    automated:
      prune: true          # Remove resources not in Git
      selfHeal: true       # Revert manual changes
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Alert on Drift

Even with self-healing, you want to know when drift happens:

yaml

# Prometheus alert for ArgoCD drift
- alert: ArgoCDApplicationOutOfSync
  expr: argocd_app_info{sync_status="OutOfSync"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Application {{ $labels.name }} is out of sync"

Lesson 4: Secrets in GitOps

The hardest problem in GitOps: you can't put secrets in Git, but GitOps says everything should be in Git.

Our Recommended Approach: External Secrets Operator

yaml

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: payment-db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: payment-db-credentials
  data:
    - secretKey: username
      remoteRef:
        key: secret/data/payment-service/db
        property: username
    - secretKey: password
      remoteRef:
        key: secret/data/payment-service/db
        property: password

Secrets live in Vault. ExternalSecret manifests live in Git. The operator syncs them into Kubernetes secrets. Best of both worlds.

Lesson 5: Multi-Cluster Management

With 50 clusters, you need a management layer. Here's our architecture:

Hub-and-Spoke with ArgoCD

Management Cluster (Hub)
├── ArgoCD (manages all clusters)
├── ApplicationSets (generates apps per cluster)
└── Cluster Secrets (credentials for spoke clusters)

Spoke Clusters
├── prod-us-east-1
├── prod-us-west-2
├── prod-eu-west-1
├── staging
└── dev

ApplicationSets for DRY Config

yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            environment: production
  template:
    metadata:
      name: '{{name}}-platform'
    spec:
      project: platform
      source:
        repoURL: https://github.com/org/gitops-config
        path: 'clusters/{{name}}/platform'
      destination:
        server: '{{server}}'
        namespace: platform

One ApplicationSet definition, applied across all production clusters automatically.

Lesson 6: Breaking Changes and Rollbacks

Instant Rollbacks

This is GitOps' superpower. Rolling back is just reverting a Git commit:

bash

git revert HEAD
git push origin main
# ArgoCD detects the change and rolls back automatically

Canary Deployments with GitOps

Use Argo Rollouts for progressive delivery:

yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 5m }
        - setWeight: 60
        - pause: { duration: 10m }
      analysis:
        templates:
          - templateName: success-rate
        args:
          - name: service-name
            value: payment-service

Key Takeaways

1.Start mono-repo, split when it hurts
2.Automate environment promotion — never manually edit image tags
3.Enable self-healing and alert on drift
4.Use External Secrets Operator for secrets management
5.ApplicationSets for multi-cluster management
6.Git revert is your rollback strategy

GitOps at scale requires investment in tooling and process, but the payoff — full audit trail, instant rollbacks, and declarative everything — is worth it.

Want to implement GitOps across your clusters? Schedule a free assessment.

GitOps at Scale: Lessons from 50+ Kubernetes Clusters

GitOps Is Simple. GitOps at Scale Is Not.

Lesson 1: Mono-Repo vs. Multi-Repo — It Depends

Mono-Repo (Recommended for < 20 services)

Multi-Repo (Recommended for > 20 services)

Our Recommendation

Lesson 2: Environment Promotion Done Right

The Right Way: Automated Image Promotion

Promotion Pipeline

Lesson 3: Drift Detection Is Non-Negotiable

Configure ArgoCD for Self-Healing

Alert on Drift

Lesson 4: Secrets in GitOps

Our Recommended Approach: External Secrets Operator

Lesson 5: Multi-Cluster Management

Hub-and-Spoke with ArgoCD

ApplicationSets for DRY Config

Lesson 6: Breaking Changes and Rollbacks

Instant Rollbacks

Canary Deployments with GitOps

Key Takeaways

MORE ARTICLES

How to Cut Cloud Costs 40% in 30 Days

SLO Design for Platform Teams: A Practical Guide

The Developer Portal Checklist: 20 Features Your IDP Needs

WANT MORE INSIGHTS?