Disaster Recovery Runbook

Step-by-step procedures for complete system recovery

RTO: 2 hours (Recovery Time Objective) RPO: 24 hours (Recovery Point Objective) Last Tested: [Update after each DR drill]

Quick Reference

Emergency Contacts

Role	Name	Phone	Email	Slack
Primary On-Call	[Your Name]	+1-XXX-XXX-XXXX	oncall@your-domain.com	@oncall
Secondary On-Call	[Backup]	+1-XXX-XXX-XXXX	backup@your-domain.com	@backup
DevOps Lead	[Lead Name]	+1-XXX-XXX-XXXX	devops@your-domain.com	@devops-lead
CTO/Technical Lead	[CTO Name]	+1-XXX-XXX-XXXX	cto@your-domain.com	@cto

Critical Credentials

Location: 1Password vault "Infrastructure-Production"

Hetzner Cloud Console: [Share login link]
GitHub Repository: [Owner access required]
S3 Backup Credentials: [Access/Secret keys in 1Password]
Database Passwords: [Stored in 1Password]
Dokploy Admin: [Credentials in 1Password]

Scenario 1: Complete Server Failure

Symptoms:

Server unreachable via SSH
All services showing as down in monitoring
Hetzner Cloud Console shows server offline/crashed
DNS resolving but connection timeout

Recovery Time: 2 hours

Step 1: Assess Situation (5 minutes)

Checklist:

Confirm server is actually down (not network issue)

# From local machine
ping app.your-domain.com
ssh root@app.your-domain.com

# Check Hetzner Cloud Console
https://console.hetzner.cloud → Servers → ripplecore-app-prod

Identify which server is down:
- Production App Server (10.0.1.2)
- Production DB Server (10.0.1.3)
- CI/CD Server (10.0.1.4)
- Staging Server (10.0.2.2)
Check Hetzner status page for datacenter issues
```
https://status.hetzner.com
```

Notify team via Slack #incidents

🚨 INCIDENT: Production [server-name] is down
Starting DR procedure - ETA 2 hours
Incident Commander: [Your Name]

Step 2: Provision Replacement Server (15 minutes)

Via Hetzner Cloud Console:

Create New Server

Navigate to: Cloud → Servers → Create Server

Location: Falkenstein (fsn1) - same as original
Image: Ubuntu 24.04 LTS
Type: [Match original - CPX32 for app, CPX22 for DB]
SSH Keys: [Select your team's SSH key]
Name: ripplecore-[service]-prod-new

Network Configuration

Private Network: ripplecore-prod-network
Subnet: 10.0.1.0/24
IP Assignment: Automatic (or assign original IP if available)

Firewall

Apply Firewall: ripplecore-prod-firewall

Reassign Floating IP (instant failover)

Navigate to: Cloud → Floating IPs → [production-floating-ip]
Click: Reassign
Select: ripplecore-[service]-prod-new
Confirm: Reassign
# DNS now points to new server (no propagation delay)

Note New Server IP

NEW_SERVER_IP="[IP from console]"
echo $NEW_SERVER_IP > /tmp/new_server_ip.txt

Expected Time: 5-10 minutes for server provisioning

Step 3: Install Base Software (30 minutes)

SSH into new server:

NEW_SERVER_IP=$(cat /tmp/new_server_ip.txt)
ssh root@$NEW_SERVER_IP

Install Docker:

# Update system
apt update && apt upgrade -y

# Install Docker
curl -fsSL https://get.docker.com | sh

# Start Docker service
systemctl start docker
systemctl enable docker

# Verify
docker --version

Install s5cmd (for S3 backup access):

curl -L https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz \
  | tar xz -C /usr/local/bin

# Verify
s5cmd version

Configure S3 credentials:

mkdir -p ~/.aws

# Copy from 1Password: "Hetzner S3 Backup Credentials"
cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
EOF

cat > ~/.aws/config <<EOF
[default]
region = eu-central
endpoint_url = https://fsn1.your-objectstorage.com
EOF

Install Netdata (monitoring):

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# Answer prompts:
# - Install? Yes
# - Telemetry? No
# - Claim to Netdata Cloud? Optional

Expected Time: 15-20 minutes

Step 4A: Restore Database Server (if DB server failed)

Start PostgreSQL Container:

docker run -d \
  --name ripplecore-postgres \
  --restart unless-stopped \
  -e POSTGRES_USER=ripplecore \
  -e POSTGRES_PASSWORD='COPY_FROM_1PASSWORD' \
  -e POSTGRES_DB=ripplecore \
  -p 5432:5432 \
  -v postgres-data:/var/lib/postgresql/data \
  postgres:18-alpine

# Verify container started
docker ps | grep ripplecore-postgres

Restore from Latest Backup:

# Find latest backup
s5cmd ls s3://ripplecore-backups/postgres/daily/ | tail -5

# Download latest
LATEST_BACKUP=$(s5cmd ls s3://ripplecore-backups/postgres/daily/ | grep .dump.gz | tail -1 | awk '{print $NF}')
s5cmd cp $LATEST_BACKUP /tmp/latest_backup.dump.gz

# Decompress
gunzip /tmp/latest_backup.dump.gz

# Restore
docker exec -i ripplecore-postgres pg_restore \
  -U ripplecore \
  -d ripplecore \
  --clean \
  --if-exists \
  --verbose \
  < /tmp/latest_backup.dump

# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"

Start Redis Container:

docker run -d \
  --name ripplecore-redis \
  --restart unless-stopped \
  -p 6379:6379 \
  -v redis-data:/data \
  redis:7-alpine redis-server --appendonly yes

# Verify
docker exec ripplecore-redis redis-cli ping
# Expected: PONG

Expected Time: 20-30 minutes

Step 4B: Restore Application Server (if app server failed)

Install Dokploy:

curl -sSL https://dokploy.com/install.sh | sh

# Access Dokploy UI
# https://[new-server-ip]:3000

# Create admin account (use same credentials as before from 1Password)

Restore Dokploy Configuration (if backed up):

# Download latest Dokploy backup
s5cmd ls s3://ripplecore-backups/config/
s5cmd cp s3://ripplecore-backups/config/dokploy_config_latest.tar.gz /tmp/

# Extract
tar xzf /tmp/dokploy_config_latest.tar.gz -C /tmp/

# Restore Dokploy database (contains project configurations)
# Note: This assumes Dokploy backup exists. Otherwise, reconfigure manually via UI.
docker exec dokploy-db sqlite3 /app/data/dokploy.db ".restore /tmp/dokploy_backup.db"

# Restart Dokploy
docker restart dokploy-app

Deploy Applications (via Dokploy UI):

Navigate to: https://[new-server-ip]:3000

For each application (app, api, web):
- Create new application
- Source: GitHub repository
- Branch: main
- Build: Dockerfile (apps/app/Dockerfile)
- Environment variables: Copy from 1Password "Production Environment Variables"
- Domain: app.your-domain.com (Traefik will handle SSL)

Or use GitHub webhook to trigger deployment:

# If webhook configured, push to main branch will auto-deploy
# Otherwise, manually trigger deployment in Dokploy UI

Verify Applications:

# Wait for deployments to complete (~5 minutes)

# Check container status
docker ps | grep ripplecore

# Test health endpoints
curl https://app.your-domain.com/api/health
curl https://api.your-domain.com/api/health
curl https://www.your-domain.com/api/health

Expected Time: 30-45 minutes

Step 5: Verify System Health (15 minutes)

Database Connectivity:

# From app server, test DB connection
docker exec ripplecore-app curl http://10.0.1.3:5432
# Or test via health endpoint
curl https://app.your-domain.com/api/health | jq '.checks.database'

Application Functionality:

# Test authentication
curl -X POST https://app.your-domain.com/api/auth/sign-in \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"testpass"}'

# Test API endpoints
curl https://api.your-domain.com/api/kindness \
  -H "Authorization: Bearer [test-token]"

Monitoring:

# Verify Netdata is running
curl http://localhost:19999/api/v1/info

# Check UptimeRobot shows services as up
# https://uptimerobot.com/dashboard

Performance Check:

# Run basic load test
ab -n 100 -c 10 https://app.your-domain.com/api/health

# Expected: All requests successful, &lt;200ms response time

User Acceptance Test:

Create test user account
Log in to application
Navigate through main pages
Create sample evidence (kindness, volunteer, etc.)
Verify data appears correctly

Expected Time: 10-15 minutes

Step 6: Update Documentation & Notify (10 minutes)

Update Internal Documentation:

# Update server inventory (in infrastructure docs)
New Production App Server IP: [IP]
Replaced Date: [Date]
Reason: Complete server failure

Notify Stakeholders:

✅ INCIDENT RESOLVED

Production system has been fully restored after server failure.

Recovery Details:
• Incident Start: [Time]
• Recovery Complete: [Time]
• Total Downtime: [Duration]
• Data Loss: None (restored from backup)
• Root Cause: [Server failure / Hardware issue / etc.]

All services are now operational and monitoring is green.

Post-Incident Tasks:

Schedule post-mortem meeting (within 24 hours)
Document root cause analysis
Update runbook with lessons learned
Test old server (if accessible) to determine failure reason
Delete old server from Hetzner Cloud Console (after 7 days grace period)

Scenario 2: Database Corruption

Symptoms:

PostgreSQL errors in logs: corrupted page detected
Application errors: relation does not exist
Data integrity check failures

Recovery Time: 1 hour

Step 1: Assess Corruption Severity (10 minutes)

# SSH into database server
ssh root@10.0.1.3

# Check PostgreSQL logs
docker logs ripplecore-postgres --tail 200 | grep -i corrupt

# Attempt database connection
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT 1"

# Check affected tables
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"

# Attempt to dump database (will fail if severely corrupted)
docker exec ripplecore-postgres pg_dump -U ripplecore -d ripplecore > /tmp/corruption_test.sql

Decision Point:

Minor Corruption (specific table): Restore that table from backup → Go to Step 2A
Severe Corruption (multiple tables/entire database): Full restore → Go to Step 2B

Step 2A: Restore Specific Table (30 minutes)

# Download latest backup
s5cmd cp s3://ripplecore-backups/postgres/daily/db_latest.dump.gz /tmp/
gunzip /tmp/db_latest.dump.gz

# Restore to temporary database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore_temp;"
docker exec -i ripplecore-postgres pg_restore -U ripplecore -d ripplecore_temp < /tmp/db_latest.dump

# Export corrupted table from backup
CORRUPTED_TABLE="kindness"
docker exec ripplecore-postgres pg_dump -U ripplecore -d ripplecore_temp -t $CORRUPTED_TABLE \
  > /tmp/${CORRUPTED_TABLE}_restore.sql

# Stop applications (prevent writes during restore)
docker stop ripplecore-app ripplecore-api ripplecore-web

# Drop and restore corrupted table
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "DROP TABLE IF EXISTS $CORRUPTED_TABLE CASCADE;"
docker exec -i ripplecore-postgres psql -U ripplecore -d ripplecore < /tmp/${CORRUPTED_TABLE}_restore.sql

# Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web

# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM $CORRUPTED_TABLE;"

# Cleanup
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore_temp;"

Step 2B: Full Database Restore (45 minutes)

Follow "Step 4A: Restore Database Server" procedure above

Key differences:

Keep existing PostgreSQL container running
Drop and recreate database instead of full server replacement

# Stop applications
docker stop ripplecore-app ripplecore-api ripplecore-web

# Download latest backup
s5cmd cp s3://ripplecore-backups/postgres/daily/db_latest.dump.gz /tmp/
gunzip /tmp/db_latest.dump.gz

# Drop corrupted database
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore;"

# Create fresh database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore;"

# Restore from backup
docker exec -i ripplecore-postgres pg_restore \
  -U ripplecore \
  -d ripplecore \
  --clean \
  --if-exists \
  --verbose \
  < /tmp/db_latest.dump

# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"

# Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web

# Verify health
curl https://app.your-domain.com/api/health

Step 3: Investigate Root Cause (20 minutes)

Common Causes:

Disk corruption (check SMART status)
Out-of-memory killer (OOM) during writes
Sudden power loss (unlikely in cloud environment)
PostgreSQL bug (rare)

Investigation:

# Check disk health
smartctl -a /dev/sda

# Check for OOM events
dmesg | grep -i "out of memory"
grep -i "oom" /var/log/syslog

# Check PostgreSQL settings
docker exec ripplecore-postgres psql -U ripplecore -c "SHOW all;"

# Review recent PostgreSQL logs
docker logs ripplecore-postgres --since 24h | grep -E "ERROR|FATAL|PANIC"

Prevention:

Enable PostgreSQL checksums (detect corruption early)
Increase shared_buffers and effective_cache_size (reduce disk I/O)
Enable full_page_writes (prevent corruption after crashes)
Regular VACUUM operations (prevent bloat)

Scenario 3: Accidental Data Deletion

Symptoms:

User reports missing data
Audit logs show DELETE operations
Table has fewer records than expected

Recovery Time: 30 minutes

Procedure

See "Selective Table Restore" in BACKUP_RECOVERY.md

Key Steps:

Identify deletion timestamp from user
Find backup before deletion
Restore to temporary database
Extract deleted records (compare with production)
Import recovered records back to production
Verify with user

Post-Recovery Checklist

After any disaster recovery:

Testing & Drills

Quarterly DR Drill Schedule:

Q1: Database server failure simulation
Q2: Application server failure simulation
Q3: Complete datacenter failure simulation (switch to backup region if multi-region)
Q4: Ransomware attack simulation

Drill Checklist:

Schedule 2-hour maintenance window
Notify team of drill in advance
Execute DR procedure
Measure actual RTO vs. target
Document issues encountered
Update runbook with improvements
Share lessons learned with team

Last Updated: [Date] Last Tested: [Date] Next DR Drill: [Date] Document Owner: [Your Name]

Disaster Recovery Runbook

On this page