Disaster Recovery Runbook
Step-by-step procedures for complete system recovery
Disaster Recovery Runbook
Step-by-step procedures for complete system recovery
RTO: 2 hours (Recovery Time Objective) RPO: 24 hours (Recovery Point Objective) Last Tested: [Update after each DR drill]
Quick Reference
Emergency Contacts
| Role | Name | Phone | Slack | |
|---|---|---|---|---|
| Primary On-Call | [Your Name] | +1-XXX-XXX-XXXX | oncall@your-domain.com | @oncall |
| Secondary On-Call | [Backup] | +1-XXX-XXX-XXXX | backup@your-domain.com | @backup |
| DevOps Lead | [Lead Name] | +1-XXX-XXX-XXXX | devops@your-domain.com | @devops-lead |
| CTO/Technical Lead | [CTO Name] | +1-XXX-XXX-XXXX | cto@your-domain.com | @cto |
Critical Credentials
Location: 1Password vault "Infrastructure-Production"
- Hetzner Cloud Console: [Share login link]
- GitHub Repository: [Owner access required]
- S3 Backup Credentials: [Access/Secret keys in 1Password]
- Database Passwords: [Stored in 1Password]
- Dokploy Admin: [Credentials in 1Password]
Scenario 1: Complete Server Failure
Symptoms:
- Server unreachable via SSH
- All services showing as down in monitoring
- Hetzner Cloud Console shows server offline/crashed
- DNS resolving but connection timeout
Recovery Time: 2 hours
Step 1: Assess Situation (5 minutes)
Checklist:
-
Confirm server is actually down (not network issue)
# From local machine ping app.your-domain.com ssh root@app.your-domain.com # Check Hetzner Cloud Console https://console.hetzner.cloud → Servers → ripplecore-app-prod -
Identify which server is down:
- Production App Server (10.0.1.2)
- Production DB Server (10.0.1.3)
- CI/CD Server (10.0.1.4)
- Staging Server (10.0.2.2)
-
Check Hetzner status page for datacenter issues
https://status.hetzner.com -
Notify team via Slack
#incidents🚨 INCIDENT: Production [server-name] is down Starting DR procedure - ETA 2 hours Incident Commander: [Your Name]
Step 2: Provision Replacement Server (15 minutes)
Via Hetzner Cloud Console:
-
Create New Server
Navigate to: Cloud → Servers → Create Server Location: Falkenstein (fsn1) - same as original Image: Ubuntu 24.04 LTS Type: [Match original - CPX32 for app, CPX22 for DB] SSH Keys: [Select your team's SSH key] Name: ripplecore-[service]-prod-new -
Network Configuration
Private Network: ripplecore-prod-network Subnet: 10.0.1.0/24 IP Assignment: Automatic (or assign original IP if available) -
Firewall
Apply Firewall: ripplecore-prod-firewall -
Reassign Floating IP (instant failover)
Navigate to: Cloud → Floating IPs → [production-floating-ip] Click: Reassign Select: ripplecore-[service]-prod-new Confirm: Reassign # DNS now points to new server (no propagation delay) -
Note New Server IP
NEW_SERVER_IP="[IP from console]" echo $NEW_SERVER_IP > /tmp/new_server_ip.txt
Expected Time: 5-10 minutes for server provisioning
Step 3: Install Base Software (30 minutes)
SSH into new server:
NEW_SERVER_IP=$(cat /tmp/new_server_ip.txt)
ssh root@$NEW_SERVER_IPInstall Docker:
# Update system
apt update && apt upgrade -y
# Install Docker
curl -fsSL https://get.docker.com | sh
# Start Docker service
systemctl start docker
systemctl enable docker
# Verify
docker --versionInstall s5cmd (for S3 backup access):
curl -L https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz \
| tar xz -C /usr/local/bin
# Verify
s5cmd versionConfigure S3 credentials:
mkdir -p ~/.aws
# Copy from 1Password: "Hetzner S3 Backup Credentials"
cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
EOF
cat > ~/.aws/config <<EOF
[default]
region = eu-central
endpoint_url = https://fsn1.your-objectstorage.com
EOFInstall Netdata (monitoring):
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Answer prompts:
# - Install? Yes
# - Telemetry? No
# - Claim to Netdata Cloud? OptionalExpected Time: 15-20 minutes
Step 4A: Restore Database Server (if DB server failed)
Start PostgreSQL Container:
docker run -d \
--name ripplecore-postgres \
--restart unless-stopped \
-e POSTGRES_USER=ripplecore \
-e POSTGRES_PASSWORD='COPY_FROM_1PASSWORD' \
-e POSTGRES_DB=ripplecore \
-p 5432:5432 \
-v postgres-data:/var/lib/postgresql/data \
postgres:18-alpine
# Verify container started
docker ps | grep ripplecore-postgresRestore from Latest Backup:
# Find latest backup
s5cmd ls s3://ripplecore-backups/postgres/daily/ | tail -5
# Download latest
LATEST_BACKUP=$(s5cmd ls s3://ripplecore-backups/postgres/daily/ | grep .dump.gz | tail -1 | awk '{print $NF}')
s5cmd cp $LATEST_BACKUP /tmp/latest_backup.dump.gz
# Decompress
gunzip /tmp/latest_backup.dump.gz
# Restore
docker exec -i ripplecore-postgres pg_restore \
-U ripplecore \
-d ripplecore \
--clean \
--if-exists \
--verbose \
< /tmp/latest_backup.dump
# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"Start Redis Container:
docker run -d \
--name ripplecore-redis \
--restart unless-stopped \
-p 6379:6379 \
-v redis-data:/data \
redis:7-alpine redis-server --appendonly yes
# Verify
docker exec ripplecore-redis redis-cli ping
# Expected: PONGExpected Time: 20-30 minutes
Step 4B: Restore Application Server (if app server failed)
Install Dokploy:
curl -sSL https://dokploy.com/install.sh | sh
# Access Dokploy UI
# https://[new-server-ip]:3000
# Create admin account (use same credentials as before from 1Password)Restore Dokploy Configuration (if backed up):
# Download latest Dokploy backup
s5cmd ls s3://ripplecore-backups/config/
s5cmd cp s3://ripplecore-backups/config/dokploy_config_latest.tar.gz /tmp/
# Extract
tar xzf /tmp/dokploy_config_latest.tar.gz -C /tmp/
# Restore Dokploy database (contains project configurations)
# Note: This assumes Dokploy backup exists. Otherwise, reconfigure manually via UI.
docker exec dokploy-db sqlite3 /app/data/dokploy.db ".restore /tmp/dokploy_backup.db"
# Restart Dokploy
docker restart dokploy-appDeploy Applications (via Dokploy UI):
Navigate to: https://[new-server-ip]:3000
-
For each application (app, api, web):
- Create new application
- Source: GitHub repository
- Branch: main
- Build: Dockerfile (apps/app/Dockerfile)
- Environment variables: Copy from 1Password "Production Environment Variables"
- Domain: app.your-domain.com (Traefik will handle SSL)
-
Or use GitHub webhook to trigger deployment:
# If webhook configured, push to main branch will auto-deploy # Otherwise, manually trigger deployment in Dokploy UI
Verify Applications:
# Wait for deployments to complete (~5 minutes)
# Check container status
docker ps | grep ripplecore
# Test health endpoints
curl https://app.your-domain.com/api/health
curl https://api.your-domain.com/api/health
curl https://www.your-domain.com/api/healthExpected Time: 30-45 minutes
Step 5: Verify System Health (15 minutes)
Database Connectivity:
# From app server, test DB connection
docker exec ripplecore-app curl http://10.0.1.3:5432
# Or test via health endpoint
curl https://app.your-domain.com/api/health | jq '.checks.database'Application Functionality:
# Test authentication
curl -X POST https://app.your-domain.com/api/auth/sign-in \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"testpass"}'
# Test API endpoints
curl https://api.your-domain.com/api/kindness \
-H "Authorization: Bearer [test-token]"Monitoring:
# Verify Netdata is running
curl http://localhost:19999/api/v1/info
# Check UptimeRobot shows services as up
# https://uptimerobot.com/dashboardPerformance Check:
# Run basic load test
ab -n 100 -c 10 https://app.your-domain.com/api/health
# Expected: All requests successful, <200ms response timeUser Acceptance Test:
- Create test user account
- Log in to application
- Navigate through main pages
- Create sample evidence (kindness, volunteer, etc.)
- Verify data appears correctly
Expected Time: 10-15 minutes
Step 6: Update Documentation & Notify (10 minutes)
Update Internal Documentation:
# Update server inventory (in infrastructure docs)
New Production App Server IP: [IP]
Replaced Date: [Date]
Reason: Complete server failureNotify Stakeholders:
✅ INCIDENT RESOLVED
Production system has been fully restored after server failure.
Recovery Details:
• Incident Start: [Time]
• Recovery Complete: [Time]
• Total Downtime: [Duration]
• Data Loss: None (restored from backup)
• Root Cause: [Server failure / Hardware issue / etc.]
All services are now operational and monitoring is green.Post-Incident Tasks:
- Schedule post-mortem meeting (within 24 hours)
- Document root cause analysis
- Update runbook with lessons learned
- Test old server (if accessible) to determine failure reason
- Delete old server from Hetzner Cloud Console (after 7 days grace period)
Scenario 2: Database Corruption
Symptoms:
- PostgreSQL errors in logs:
corrupted page detected - Application errors:
relation does not exist - Data integrity check failures
Recovery Time: 1 hour
Step 1: Assess Corruption Severity (10 minutes)
# SSH into database server
ssh root@10.0.1.3
# Check PostgreSQL logs
docker logs ripplecore-postgres --tail 200 | grep -i corrupt
# Attempt database connection
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT 1"
# Check affected tables
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
# Attempt to dump database (will fail if severely corrupted)
docker exec ripplecore-postgres pg_dump -U ripplecore -d ripplecore > /tmp/corruption_test.sqlDecision Point:
- Minor Corruption (specific table): Restore that table from backup → Go to Step 2A
- Severe Corruption (multiple tables/entire database): Full restore → Go to Step 2B
Step 2A: Restore Specific Table (30 minutes)
# Download latest backup
s5cmd cp s3://ripplecore-backups/postgres/daily/db_latest.dump.gz /tmp/
gunzip /tmp/db_latest.dump.gz
# Restore to temporary database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore_temp;"
docker exec -i ripplecore-postgres pg_restore -U ripplecore -d ripplecore_temp < /tmp/db_latest.dump
# Export corrupted table from backup
CORRUPTED_TABLE="kindness"
docker exec ripplecore-postgres pg_dump -U ripplecore -d ripplecore_temp -t $CORRUPTED_TABLE \
> /tmp/${CORRUPTED_TABLE}_restore.sql
# Stop applications (prevent writes during restore)
docker stop ripplecore-app ripplecore-api ripplecore-web
# Drop and restore corrupted table
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "DROP TABLE IF EXISTS $CORRUPTED_TABLE CASCADE;"
docker exec -i ripplecore-postgres psql -U ripplecore -d ripplecore < /tmp/${CORRUPTED_TABLE}_restore.sql
# Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web
# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM $CORRUPTED_TABLE;"
# Cleanup
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore_temp;"Step 2B: Full Database Restore (45 minutes)
Follow "Step 4A: Restore Database Server" procedure above
Key differences:
- Keep existing PostgreSQL container running
- Drop and recreate database instead of full server replacement
# Stop applications
docker stop ripplecore-app ripplecore-api ripplecore-web
# Download latest backup
s5cmd cp s3://ripplecore-backups/postgres/daily/db_latest.dump.gz /tmp/
gunzip /tmp/db_latest.dump.gz
# Drop corrupted database
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore;"
# Create fresh database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore;"
# Restore from backup
docker exec -i ripplecore-postgres pg_restore \
-U ripplecore \
-d ripplecore \
--clean \
--if-exists \
--verbose \
< /tmp/db_latest.dump
# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"
# Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web
# Verify health
curl https://app.your-domain.com/api/healthStep 3: Investigate Root Cause (20 minutes)
Common Causes:
- Disk corruption (check SMART status)
- Out-of-memory killer (OOM) during writes
- Sudden power loss (unlikely in cloud environment)
- PostgreSQL bug (rare)
Investigation:
# Check disk health
smartctl -a /dev/sda
# Check for OOM events
dmesg | grep -i "out of memory"
grep -i "oom" /var/log/syslog
# Check PostgreSQL settings
docker exec ripplecore-postgres psql -U ripplecore -c "SHOW all;"
# Review recent PostgreSQL logs
docker logs ripplecore-postgres --since 24h | grep -E "ERROR|FATAL|PANIC"Prevention:
- Enable PostgreSQL checksums (detect corruption early)
- Increase
shared_buffersandeffective_cache_size(reduce disk I/O) - Enable
full_page_writes(prevent corruption after crashes) - Regular VACUUM operations (prevent bloat)
Scenario 3: Accidental Data Deletion
Symptoms:
- User reports missing data
- Audit logs show DELETE operations
- Table has fewer records than expected
Recovery Time: 30 minutes
Procedure
See "Selective Table Restore" in BACKUP_RECOVERY.md
Key Steps:
- Identify deletion timestamp from user
- Find backup before deletion
- Restore to temporary database
- Extract deleted records (compare with production)
- Import recovered records back to production
- Verify with user
Post-Recovery Checklist
After any disaster recovery:
- All services are operational
- Monitoring shows green status
- Users can access application
- Data integrity verified (sample checks)
- Backup process tested after recovery
- Team notified of resolution
- Incident documentation created
- Post-mortem scheduled (within 24h)
- Runbook updated with lessons learned
- Infrastructure documentation updated
- Old/failed resources cleaned up (after grace period)
Testing & Drills
Quarterly DR Drill Schedule:
- Q1: Database server failure simulation
- Q2: Application server failure simulation
- Q3: Complete datacenter failure simulation (switch to backup region if multi-region)
- Q4: Ransomware attack simulation
Drill Checklist:
- Schedule 2-hour maintenance window
- Notify team of drill in advance
- Execute DR procedure
- Measure actual RTO vs. target
- Document issues encountered
- Update runbook with improvements
- Share lessons learned with team
Last Updated: [Date] Last Tested: [Date] Next DR Drill: [Date] Document Owner: [Your Name]