In This Guide
Database migrations are among the riskiest operations in software engineering. A bad migration can take down your application, corrupt data, or create inconsistencies that take weeks to untangle. We've been through enough of these — both smooth and painful — to know what works and what doesn't.
This guide covers every migration scenario: schema changes that don't lock your tables, engine swaps that don't lose data, and rollback plans that actually work when things go wrong. The common thread: never do anything you can't undo.
1. Types of Database Migrations
| Migration Type | Example | Risk Level | Typical Downtime |
|---|---|---|---|
| Schema change | Add column, rename table | Low–Medium | 0 (if done right) |
| Data transformation | Backfill new column, split table | Medium | 0 (batched) |
| Version upgrade | PostgreSQL 15 → 16 | Medium | Minutes (replica promotion) |
| Engine migration | MySQL → PostgreSQL | High | 0 (dual-write), minutes (cutover) |
| Cloud migration | Self-hosted → RDS/Cloud SQL | High | 0 (with DMS/replication) |
2. Schema Migrations — The Expand-Contract Pattern
The expand-contract pattern is the gold standard for zero-downtime schema changes. Instead of modifying in place, you expand (add new), migrate data, then contract (remove old).
Example: Renaming a Column (the safe way)
-- WRONG: This breaks all existing queries instantly
ALTER TABLE users RENAME COLUMN name TO full_name; -- 💥 Downtime
-- RIGHT: Expand-Contract (3 deployments, 0 downtime)
-- Deploy 1: EXPAND — add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Backfill (batched to avoid locks):
UPDATE users SET full_name = name WHERE id BETWEEN 1 AND 10000;
UPDATE users SET full_name = name WHERE id BETWEEN 10001 AND 20000;
-- ...continue in batches
-- Deploy 2: MIGRATE — app writes to both, reads from new
-- Application code:
-- INSERT INTO users (name, full_name, ...) VALUES ($1, $1, ...);
-- SELECT full_name FROM users WHERE ...;
-- Deploy 3: CONTRACT — remove old column (after verifying all reads use new)
ALTER TABLE users DROP COLUMN name;
Dangerous Schema Operations (and Safe Alternatives)
| Operation | Risk | Safe Alternative |
|---|---|---|
| ADD COLUMN with DEFAULT | Table lock (MySQL < 8.0) | ADD COLUMN, then backfill in batches |
| ADD NOT NULL constraint | Full table scan to validate | Add CHECK constraint with NOT VALID, then VALIDATE separately |
| CREATE INDEX | Table lock during build | CREATE INDEX CONCURRENTLY (PostgreSQL) |
| Change column type | Table rewrite | Add new column, backfill, swap reads, drop old |
| DROP COLUMN | Breaks code still reading it | Remove all reads first, then drop in separate deploy |
| RENAME TABLE | Breaks all queries | Create new table, dual-write, migrate reads, drop old |
3. Engine Migrations — Changing Databases Entirely
Switching from MySQL to PostgreSQL, or from PostgreSQL to a NoSQL database, is a major undertaking. Here's the proven four-phase approach.
The Four Phases
Phase 1: PREPARE (2-4 weeks)
├── Audit all queries — identify incompatible syntax
├── Build abstraction layer (Repository pattern)
├── Set up new database with equivalent schema
├── Build data validation scripts (row counts, checksums)
└── Create rollback plan
Phase 2: DUAL-WRITE (1-2 weeks)
├── Write to BOTH old and new databases
├── Read from OLD database only
├── Verify new database has consistent data
└── Fix any sync discrepancies
Phase 3: SHADOW-READ (1-2 weeks)
├── Read from BOTH databases
├── Compare results, log mismatches
├── Primary reads still from old database
└── Fix all mismatches before proceeding
Phase 4: CUTOVER (1 day)
├── Switch reads to new database
├── Keep dual-writes for rollback window (24-48 hours)
├── Monitor error rates, latency, data consistency
└── Disable old database writes after rollback window
4. The Dual-Write Pattern
Dual-write means every write goes to both the old and new database. It's the foundation of zero-downtime engine migrations, but it has subtle pitfalls.
Dual-Write Implementation
class OrderRepository {
constructor(
private oldDb: MySQLConnection,
private newDb: PostgreSQLConnection,
private migrationPhase: 'dual-write' | 'shadow-read' | 'cutover'
) {}
async createOrder(order: Order): Promise<Order> {
// Always write to old (source of truth during migration)
const result = await this.oldDb.insert('orders', order);
// Write to new database — async, don't block on failure
try {
await this.newDb.insert('orders', this.transformForNewDb(order));
} catch (error) {
// Log for reconciliation, DON'T fail the request
logger.error('Dual-write to new DB failed', { orderId: order.id, error });
await this.reconciliationQueue.add({ table: 'orders', id: order.id });
}
return result;
}
async getOrder(id: string): Promise<Order> {
if (this.migrationPhase === 'cutover') {
return this.newDb.query('SELECT * FROM orders WHERE id = $1', [id]);
}
const oldResult = await this.oldDb.query('SELECT * FROM orders WHERE id = ?', [id]);
if (this.migrationPhase === 'shadow-read') {
// Read from new DB too, compare results, log mismatches
const newResult = await this.newDb.query('SELECT * FROM orders WHERE id = $1', [id]);
if (!this.deepEqual(oldResult, newResult)) {
logger.warn('Shadow read mismatch', { id, old: oldResult, new: newResult });
}
}
return oldResult; // Always return from old during migration
}
}
- Ordering issues — if write to old succeeds but new fails, then retries, you can get out-of-order writes. Use idempotent writes.
- Increased latency — writing to two databases doubles write time. Make the secondary write async.
- Transaction consistency — you can't have ACID transactions across two databases. Accept eventual consistency during migration.
5. Shadow Database Strategy
An alternative to dual-write: use Change Data Capture (CDC) to replicate from old to new. This is less invasive — no application code changes for the write path.
Application ──→ Old Database (MySQL)
│
│ CDC (Debezium / DMS)
▼
Shadow Database (PostgreSQL)
│
│ Validation job (compares row counts, checksums)
▼
Dashboard: "99.97% consistent, 12 rows divergent"
When consistency reaches 100% and stays there for 48 hours → safe to cut over
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Dual-Write | Full control, works with any DB | App code changes, consistency risk | Different DB engines |
| CDC (Debezium) | No app changes, real-time sync | Setup complexity, transform limitations | Same schema, different engine |
| AWS DMS | Managed, handles schema conversion | AWS lock-in, limited customization | Cloud migrations |
6. Large-Scale Data Migration
Moving terabytes of data without locking tables or consuming all your database's resources requires careful batching.
Batched Data Backfill
#!/bin/bash
# Batched migration — 1000 rows at a time with throttling
BATCH_SIZE=1000
LAST_ID=0
TOTAL=0
while true; do
# Migrate next batch
COUNT=$(psql -t -A -c "
INSERT INTO new_orders (id, customer_id, amount, status, created_at)
SELECT id, customer_id, amount, status, created_at
FROM old_orders
WHERE id > $LAST_ID
ORDER BY id
LIMIT $BATCH_SIZE
ON CONFLICT (id) DO NOTHING
RETURNING 1
" | wc -l)
# Update cursor
LAST_ID=$(psql -t -A -c "
SELECT COALESCE(MAX(id), $LAST_ID) FROM new_orders WHERE id > $LAST_ID
")
TOTAL=$((TOTAL + COUNT))
echo "Migrated $TOTAL rows (last_id: $LAST_ID)"
# No more rows — done
if [ "$COUNT" -lt "$BATCH_SIZE" ]; then
echo "Migration complete: $TOTAL rows"
break
fi
# Throttle — sleep between batches to avoid overwhelming the database
sleep 0.1
done
| Data Volume | Recommended Approach | Estimated Time |
|---|---|---|
| < 1 GB | pg_dump / mysqldump | Minutes |
| 1–100 GB | Batched INSERT with throttling | Hours |
| 100 GB – 1 TB | Parallel workers + CDC for ongoing changes | Hours to days |
| > 1 TB | Initial bulk load (COPY) + CDC for delta sync | Days |
7. Rollback Strategies
Every migration needs a tested rollback plan. The best rollback is one you never need — but the second-best is one that's been practiced.
| Migration Type | Rollback Strategy | Rollback Time | Data Loss Risk |
|---|---|---|---|
| Add column | Drop column (safe if not read yet) | Seconds | None |
| Engine migration | Keep dual-writes active, switch reads back | Minutes (config change) | None (both DBs have data) |
| Version upgrade | Promote old replica back to primary | Minutes | Writes since cutover |
| Data transformation | Keep old column until verified, revert reads | Seconds (deploy revert) | None (old data preserved) |
8. Migration Checklist
Use this checklist for any non-trivial database migration. Skip items at your own risk.
Pre-Migration
- Document rollback plan — specific commands, not "revert if needed"
- Test migration on staging with production-size data
- Measure migration duration on staging (multiply by 1.5x for production)
- Notify on-call team and stakeholders of migration window
- Take a backup (and verify you can restore from it)
- Set up monitoring dashboards for error rates and latency
During Migration
- Monitor database CPU, connections, replication lag
- Watch application error rates in real-time
- Run data validation queries continuously
- Have rollback command ready to execute
Post-Migration
- Verify row counts match (old vs new)
- Run checksums on critical tables
- Test all critical application flows end-to-end
- Keep old database running for rollback window (minimum 48 hours)
- Schedule cleanup of old resources after rollback window
Frequently Asked Questions
Is zero-downtime migration really possible?
Yes, for most migration types. Schema changes use expand-contract. Engine migrations use dual-write + shadow reads. The tradeoff is increased complexity and migration duration — what could be a 5-minute maintenance window becomes a multi-week gradual rollout. For most production systems, the extra complexity is worth it.
How do I migrate a 500GB database?
Initial bulk load with COPY or pg_dump (hours, can run during low traffic), then CDC (Debezium or AWS DMS) for ongoing changes. Once CDC catches up and consistency is verified, cut over. The initial load is the slow part — CDC keeps it in sync after that.
What's the biggest migration mistake teams make?
Not testing with production-volume data. A migration that takes 2 seconds on a dev database with 1,000 rows can lock your table for 45 minutes on a production table with 100 million rows. Always test on a staging environment with realistic data volumes before touching production.
Should I use an ORM migration tool or raw SQL?
ORM migration tools (Django migrations, Alembic, Rails migrations) are fine for simple schema changes. For complex migrations — data transformations, engine switches, performance-sensitive operations — write raw SQL. ORM-generated SQL often includes unnecessary locks or suboptimal patterns. Always review the generated SQL before running in production.
How do I handle foreign keys during migration?
Migrate parent tables first, then child tables. If using expand-contract, add the new foreign key as NOT VALID first (PostgreSQL), backfill data, then VALIDATE CONSTRAINT separately. In MySQL, temporarily disable foreign key checks during bulk loads (SET FOREIGN_KEY_CHECKS=0), but re-enable and validate immediately after.
Pillai Infotech LLP
We plan and execute database migrations for production systems — from schema changes to full engine swaps. Let's plan your migration.