We switched our client deployments to GitOps two years ago and haven't looked back. The pitch is simple: your Git repository is the single source of truth for what should be running in your cluster. An operator watches the repo and makes reality match. No more kubectl apply from laptops, no more "who changed that deployment?" mysteries, no more manual rollbacks at 2 AM.
What We'll Cover
GitOps Principles (The Actual Definition)
The OpenGitOps project defines four principles. They sound academic, but each one solves a real operational pain.
| Principle | What It Means | What It Solves |
|---|---|---|
| Declarative | Describe the desired state, not the steps to get there | "It worked on my machine" — because the machine state is defined in code |
| Versioned and immutable | Desired state stored in Git with full history | "Who changed what and when?" — git log tells you everything |
| Pulled automatically | Agents pull desired state and apply it (not pushed by CI) | Drift detection — if someone does a manual kubectl edit, it gets reverted |
| Continuously reconciled | Agents constantly compare desired vs actual state and correct drift | "The deploy is stuck" — the system self-heals instead of waiting for human intervention |
Push vs. Pull: Why It Matters
Traditional CI/CD is push-based: your pipeline builds an image, then pushes changes to the cluster (kubectl apply, helm upgrade). GitOps is pull-based: an agent running inside the cluster watches a Git repo and pulls changes.
Why does this distinction matter?
- Security: Push-based means your CI system needs cluster credentials. That's a juicy target. Pull-based means the agent already has cluster access (it lives there) and only needs read access to Git. Much smaller attack surface
- Reliability: If your CI system goes down, push-based deployments stop. Pull-based deployments continue — the agent is independent of CI
- Drift correction: Push-based only applies state at deploy time. Someone does a manual
kubectl edit? It sticks. Pull-based continuously reconciles — manual changes get reverted automatically
ArgoCD vs. Flux: Honest Comparison
These are the two dominant GitOps operators. We've used both extensively. Here's the unfiltered comparison.
| Aspect | ArgoCD | Flux v2 |
|---|---|---|
| UI | Excellent built-in web UI showing sync status, resource tree, diffs | No built-in UI (use Weave GitOps or Capacitor). CLI-first |
| Multi-cluster | Native — register clusters, deploy from central ArgoCD | Possible but each cluster runs its own Flux instance |
| RBAC | Built-in RBAC with SSO integration (Dex, OIDC) | Relies on Kubernetes RBAC. Simpler but less granular for UI access |
| Helm support | First-class — renders Helm in the UI, shows values | First-class — HelmRelease CRD, automatic chart updates |
| Kustomize | Supported, but Helm is the default path | Native — Kustomize is the preferred approach |
| Image automation | Argo Image Updater (separate component, works but quirky) | Built-in image automation controllers. Cleaner implementation |
| Learning curve | Moderate — UI helps, but ApplicationSet and AppOfApps patterns take time | Steeper initially (no UI to explore), but simpler conceptual model |
| Community/adoption | Larger community, more tutorials, more Stack Overflow answers | Strong CNCF backing, growing fast, fewer "how do I..." resources |
Our recommendation: ArgoCD if you need multi-cluster management or a non-technical audience needs visibility into deployments (PMs, managers — the UI is genuinely good). Flux if your team is comfortable with CLI-first workflows and you want tighter Kubernetes-native integration.
For most of our clients, we go with ArgoCD. The UI alone saves hours of "what's deployed where?" questions.
Repository Structure That Scales
Repo structure is the decision you'll live with the longest. We've tried several approaches and this is what we've settled on.
The Approach We Recommend: App-of-Apps with Env Overlays
# Monorepo structure (works up to ~50 services)
gitops-config/
├── apps/ # Per-application configs
│ ├── user-service/
│ │ ├── base/ # Shared config
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ ├── hpa.yaml
│ │ │ └── kustomization.yaml
│ │ └── overlays/
│ │ ├── staging/ # Staging overrides
│ │ │ ├── replicas.yaml # replicas: 1
│ │ │ ├── resources.yaml # smaller limits
│ │ │ └── kustomization.yaml
│ │ └── production/ # Production overrides
│ │ ├── replicas.yaml # replicas: 3
│ │ ├── resources.yaml # production limits
│ │ └── kustomization.yaml
│ ├── order-service/
│ │ └── ... (same structure)
│ └── payment-service/
│ └── ...
├── infrastructure/ # Cluster-level resources
│ ├── cert-manager/
│ ├── ingress-nginx/
│ ├── monitoring/ # Prometheus + Grafana
│ └── sealed-secrets/
├── argocd/ # ArgoCD Application manifests
│ ├── app-of-apps.yaml # Root Application
│ ├── staging.yaml # ApplicationSet for staging
│ └── production.yaml # ApplicationSet for production
└── README.md
Why Monorepo Over Multi-Repo
We tried having one config repo per service (matching the app source repo). It works for 5 services. At 20, it's a nightmare — PRs to update a shared Helm chart version require 20 separate PRs. Monorepo lets you make cross-cutting changes atomically.
The exception: if you have strict team isolation requirements (different teams should only see/modify their own configs), multi-repo with ArgoCD ApplicationSets works. Just know that coordination costs go up.
ArgoCD ApplicationSet Example
# argocd/production.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: production-apps
namespace: argocd
spec:
generators:
- git:
repoURL: https://github.com/pillai-infotech/gitops-config
revision: main
directories:
- path: apps/*/overlays/production
template:
metadata:
name: '{{path[1]}}-production'
spec:
project: production
source:
repoURL: https://github.com/pillai-infotech/gitops-config
targetRevision: main
path: '{{path}}'
destination:
server: https://kubernetes.default.svc
namespace: '{{path[1]}}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
This automatically discovers every app in the apps/*/overlays/production directory and creates an ArgoCD Application for it. Add a new service? Just add a new directory. ArgoCD picks it up.
Progressive Delivery with GitOps
GitOps tells you what to deploy. Progressive delivery tells you how to deploy it safely. The combination is powerful.
Canary Deployments with Argo Rollouts
# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: user-service
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 10 # Send 10% of traffic to new version
- pause: {duration: 5m} # Wait 5 minutes
- analysis: # Run automated analysis
templates:
- templateName: success-rate
args:
- name: service-name
value: user-service
- setWeight: 30 # If analysis passes, bump to 30%
- pause: {duration: 5m}
- setWeight: 60
- pause: {duration: 5m}
- setWeight: 100 # Full rollout
The analysis step is where it gets interesting — you can automatically check error rates, latency percentiles, and custom metrics. If the canary fails any check, it automatically rolls back. No human intervention needed at 2 AM.
Blue-Green vs. Canary: When to Use Which
- Canary: When you need gradual traffic shifting and can tolerate two versions running simultaneously. Best for stateless services. This is our default
- Blue-Green: When you need instant cutover (database schema changes, breaking API changes). More resource-intensive (double the pods during transition)
Handling Secrets in GitOps
The biggest challenge with "everything in Git" is secrets. You can't commit plaintext secrets. Here are the three approaches we use, depending on the client's setup.
| Approach | How It Works | Best For | Complexity |
|---|---|---|---|
| Sealed Secrets | Encrypt secrets with a cluster-specific key. Only that cluster can decrypt. Encrypted values live in Git | Simple setups, single cluster | Low |
| External Secrets Operator | Kubernetes operator syncs secrets from AWS Secrets Manager, Vault, GCP Secret Manager into K8s secrets | Multi-cluster, existing secret stores | Medium |
| SOPS + age/KMS | Encrypt specific values in YAML files. Decrypt at apply time. Git stores encrypted files | Teams that want secrets versioned alongside config | Medium |
Our current default: External Secrets Operator + AWS Secrets Manager. The Git repo references the secret by name, the operator fetches the actual value at runtime. Clean separation, works across environments, integrates with Kubernetes security best practices.
Common Pitfalls (From Experience)
We've made all of these mistakes so you don't have to.
| Pitfall | What Happens | Fix |
|---|---|---|
| Auto-sync without prune protection | Delete a file from Git → ArgoCD deletes the resource from the cluster. Including databases | Use Prune=false for stateful resources. Add argocd.argoproj.io/sync-options: Prune=false annotation |
| Committing image tags in CI | CI pipeline updates manifests and commits — creates infinite loops (commit triggers CI which triggers commit...) | Use image automation (Flux) or Argo Image Updater. Or use a separate CI commit with [skip ci] in the message |
| No environment promotion strategy | Staging and production drift apart. Changes tested in staging aren't the same as what reaches production | Use Kustomize overlays with a base. Promote by moving the image tag from staging overlay to production overlay via PR |
| Treating config repo as "infra" | App developers don't own their configs. Changes require tickets to the "GitOps team" | App teams own their apps/{service}/ directory. Platform team owns infrastructure/ and argocd/ |
| Not testing manifests before merge | Broken YAML or invalid resources get merged → ArgoCD enters degraded state | CI on the config repo: kustomize build, kubeval/kubeconform, OPA policy checks. Catch errors before merge |
Our GitOps Workflow in Practice
Here's the actual deployment flow we use for a platform engineering client running 35 services on EKS:
- Developer pushes code to the app source repo. Triggers CI pipeline (GitHub Actions)
- CI builds container image, runs tests, pushes to ECR with tag
sha-abc123 - CI opens a PR in the gitops-config repo, updating the image tag in the staging overlay
- Auto-merge to staging (bot approves staging PRs from CI). ArgoCD syncs within 3 minutes
- Automated tests run against staging (smoke tests, integration tests)
- Developer opens promotion PR — copies image tag to production overlay. This is the human approval gate
- After PR merge, ArgoCD syncs production. Argo Rollouts does a canary: 10% → 30% → 60% → 100% over 15 minutes with automated analysis
- If canary fails, automatic rollback. Slack notification. Developer investigates
Average time from code push to production: 25 minutes for staging + however long the developer takes to review and promote. The actual production rollout is 15 minutes.
Frequently Asked Questions
Can GitOps work without Kubernetes?
Technically yes — the principles apply to any declarative system. Terraform with Atlantis is GitOps for cloud infrastructure. But the tooling (ArgoCD, Flux) is K8s-specific. For non-K8s deployments, CI/CD pipelines with Git-triggered deploys give you most of the benefits.
How do you handle database migrations in GitOps?
Database migrations don't fit pure GitOps because they're imperative (run this SQL), not declarative. We run migrations as Kubernetes Jobs triggered by pre-sync hooks in ArgoCD. The migration runs before the new app version deploys. If it fails, the sync fails and the old version keeps running.
Should we use Helm or Kustomize for GitOps manifests?
Kustomize for your own apps (simpler, more transparent, easier to review in PRs). Helm for third-party charts (cert-manager, ingress-nginx, monitoring stacks). Most GitOps setups use both — Helm for infrastructure components, Kustomize for application manifests.
How do you roll back a failed deployment with GitOps?
git revert the commit that introduced the bad change. ArgoCD detects the new commit and syncs the previous working state. It's a 30-second operation: git revert HEAD && git push. No SSH into servers, no remembering which Helm values to change. The rollback is versioned just like the deploy.