RippleCore

Disaster Recovery Runbook

Step-by-step procedures for complete system recovery

Disaster Recovery Runbook

Step-by-step procedures for complete system recovery

RTO: 2 hours (Recovery Time Objective) RPO: 24 hours (Recovery Point Objective) Last Tested: [Update after each DR drill]

Quick Reference

Emergency Contacts

RoleNamePhoneEmailSlack
Primary On-Call[Your Name]+1-XXX-XXX-XXXXoncall@your-domain.com@oncall
Secondary On-Call[Backup]+1-XXX-XXX-XXXXbackup@your-domain.com@backup
DevOps Lead[Lead Name]+1-XXX-XXX-XXXXdevops@your-domain.com@devops-lead
CTO/Technical Lead[CTO Name]+1-XXX-XXX-XXXXcto@your-domain.com@cto

Critical Credentials

Location: 1Password vault "Infrastructure-Production"

  • Hetzner Cloud Console: [Share login link]
  • GitHub Repository: [Owner access required]
  • S3 Backup Credentials: [Access/Secret keys in 1Password]
  • Database Passwords: [Stored in 1Password]
  • Dokploy Admin: [Credentials in 1Password]

Scenario 1: Complete Server Failure

Symptoms:

  • Server unreachable via SSH
  • All services showing as down in monitoring
  • Hetzner Cloud Console shows server offline/crashed
  • DNS resolving but connection timeout

Recovery Time: 2 hours


Step 1: Assess Situation (5 minutes)

Checklist:

  • Confirm server is actually down (not network issue)

    # From local machine
    ping app.your-domain.com
    ssh root@app.your-domain.com
    
    # Check Hetzner Cloud Console
    https://console.hetzner.cloud Servers ripplecore-app-prod
  • Identify which server is down:

    • Production App Server (10.0.1.2)
    • Production DB Server (10.0.1.3)
    • CI/CD Server (10.0.1.4)
    • Staging Server (10.0.2.2)
  • Check Hetzner status page for datacenter issues

    https://status.hetzner.com
  • Notify team via Slack #incidents

    🚨 INCIDENT: Production [server-name] is down
    Starting DR procedure - ETA 2 hours
    Incident Commander: [Your Name]

Step 2: Provision Replacement Server (15 minutes)

Via Hetzner Cloud Console:

  1. Create New Server

    Navigate to: Cloud → Servers → Create Server
    
    Location: Falkenstein (fsn1) - same as original
    Image: Ubuntu 24.04 LTS
    Type: [Match original - CPX32 for app, CPX22 for DB]
    SSH Keys: [Select your team's SSH key]
    Name: ripplecore-[service]-prod-new
  2. Network Configuration

    Private Network: ripplecore-prod-network
    Subnet: 10.0.1.0/24
    IP Assignment: Automatic (or assign original IP if available)
  3. Firewall

    Apply Firewall: ripplecore-prod-firewall
  4. Reassign Floating IP (instant failover)

    Navigate to: Cloud → Floating IPs → [production-floating-ip]
    Click: Reassign
    Select: ripplecore-[service]-prod-new
    Confirm: Reassign
    # DNS now points to new server (no propagation delay)
  5. Note New Server IP

    NEW_SERVER_IP="[IP from console]"
    echo $NEW_SERVER_IP > /tmp/new_server_ip.txt

Expected Time: 5-10 minutes for server provisioning


Step 3: Install Base Software (30 minutes)

SSH into new server:

NEW_SERVER_IP=$(cat /tmp/new_server_ip.txt)
ssh root@$NEW_SERVER_IP

Install Docker:

# Update system
apt update && apt upgrade -y

# Install Docker
curl -fsSL https://get.docker.com | sh

# Start Docker service
systemctl start docker
systemctl enable docker

# Verify
docker --version

Install s5cmd (for S3 backup access):

curl -L https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz \
  | tar xz -C /usr/local/bin

# Verify
s5cmd version

Configure S3 credentials:

mkdir -p ~/.aws

# Copy from 1Password: "Hetzner S3 Backup Credentials"
cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
EOF

cat > ~/.aws/config <<EOF
[default]
region = eu-central
endpoint_url = https://fsn1.your-objectstorage.com
EOF

Install Netdata (monitoring):

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# Answer prompts:
# - Install? Yes
# - Telemetry? No
# - Claim to Netdata Cloud? Optional

Expected Time: 15-20 minutes


Step 4A: Restore Database Server (if DB server failed)

Start PostgreSQL Container:

docker run -d \
  --name ripplecore-postgres \
  --restart unless-stopped \
  -e POSTGRES_USER=ripplecore \
  -e POSTGRES_PASSWORD='COPY_FROM_1PASSWORD' \
  -e POSTGRES_DB=ripplecore \
  -p 5432:5432 \
  -v postgres-data:/var/lib/postgresql/data \
  postgres:18-alpine

# Verify container started
docker ps | grep ripplecore-postgres

Restore from Latest Backup:

# Find latest backup
s5cmd ls s3://ripplecore-backups/postgres/daily/ | tail -5

# Download latest
LATEST_BACKUP=$(s5cmd ls s3://ripplecore-backups/postgres/daily/ | grep .dump.gz | tail -1 | awk '{print $NF}')
s5cmd cp $LATEST_BACKUP /tmp/latest_backup.dump.gz

# Decompress
gunzip /tmp/latest_backup.dump.gz

# Restore
docker exec -i ripplecore-postgres pg_restore \
  -U ripplecore \
  -d ripplecore \
  --clean \
  --if-exists \
  --verbose \
  < /tmp/latest_backup.dump

# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"

Start Redis Container:

docker run -d \
  --name ripplecore-redis \
  --restart unless-stopped \
  -p 6379:6379 \
  -v redis-data:/data \
  redis:7-alpine redis-server --appendonly yes

# Verify
docker exec ripplecore-redis redis-cli ping
# Expected: PONG

Expected Time: 20-30 minutes


Step 4B: Restore Application Server (if app server failed)

Install Dokploy:

curl -sSL https://dokploy.com/install.sh | sh

# Access Dokploy UI
# https://[new-server-ip]:3000

# Create admin account (use same credentials as before from 1Password)

Restore Dokploy Configuration (if backed up):

# Download latest Dokploy backup
s5cmd ls s3://ripplecore-backups/config/
s5cmd cp s3://ripplecore-backups/config/dokploy_config_latest.tar.gz /tmp/

# Extract
tar xzf /tmp/dokploy_config_latest.tar.gz -C /tmp/

# Restore Dokploy database (contains project configurations)
# Note: This assumes Dokploy backup exists. Otherwise, reconfigure manually via UI.
docker exec dokploy-db sqlite3 /app/data/dokploy.db ".restore /tmp/dokploy_backup.db"

# Restart Dokploy
docker restart dokploy-app

Deploy Applications (via Dokploy UI):

Navigate to: https://[new-server-ip]:3000

  1. For each application (app, api, web):

    • Create new application
    • Source: GitHub repository
    • Branch: main
    • Build: Dockerfile (apps/app/Dockerfile)
    • Environment variables: Copy from 1Password "Production Environment Variables"
    • Domain: app.your-domain.com (Traefik will handle SSL)
  2. Or use GitHub webhook to trigger deployment:

    # If webhook configured, push to main branch will auto-deploy
    # Otherwise, manually trigger deployment in Dokploy UI

Verify Applications:

# Wait for deployments to complete (~5 minutes)

# Check container status
docker ps | grep ripplecore

# Test health endpoints
curl https://app.your-domain.com/api/health
curl https://api.your-domain.com/api/health
curl https://www.your-domain.com/api/health

Expected Time: 30-45 minutes


Step 5: Verify System Health (15 minutes)

Database Connectivity:

# From app server, test DB connection
docker exec ripplecore-app curl http://10.0.1.3:5432
# Or test via health endpoint
curl https://app.your-domain.com/api/health | jq '.checks.database'

Application Functionality:

# Test authentication
curl -X POST https://app.your-domain.com/api/auth/sign-in \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"testpass"}'

# Test API endpoints
curl https://api.your-domain.com/api/kindness \
  -H "Authorization: Bearer [test-token]"

Monitoring:

# Verify Netdata is running
curl http://localhost:19999/api/v1/info

# Check UptimeRobot shows services as up
# https://uptimerobot.com/dashboard

Performance Check:

# Run basic load test
ab -n 100 -c 10 https://app.your-domain.com/api/health

# Expected: All requests successful, &lt;200ms response time

User Acceptance Test:

  • Create test user account
  • Log in to application
  • Navigate through main pages
  • Create sample evidence (kindness, volunteer, etc.)
  • Verify data appears correctly

Expected Time: 10-15 minutes


Step 6: Update Documentation & Notify (10 minutes)

Update Internal Documentation:

# Update server inventory (in infrastructure docs)
New Production App Server IP: [IP]
Replaced Date: [Date]
Reason: Complete server failure

Notify Stakeholders:

 INCIDENT RESOLVED

Production system has been fully restored after server failure.

Recovery Details:
 Incident Start: [Time]
 Recovery Complete: [Time]
 Total Downtime: [Duration]
 Data Loss: None (restored from backup)
 Root Cause: [Server failure / Hardware issue / etc.]

All services are now operational and monitoring is green.

Post-Incident Tasks:

  • Schedule post-mortem meeting (within 24 hours)
  • Document root cause analysis
  • Update runbook with lessons learned
  • Test old server (if accessible) to determine failure reason
  • Delete old server from Hetzner Cloud Console (after 7 days grace period)

Scenario 2: Database Corruption

Symptoms:

  • PostgreSQL errors in logs: corrupted page detected
  • Application errors: relation does not exist
  • Data integrity check failures

Recovery Time: 1 hour


Step 1: Assess Corruption Severity (10 minutes)

# SSH into database server
ssh root@10.0.1.3

# Check PostgreSQL logs
docker logs ripplecore-postgres --tail 200 | grep -i corrupt

# Attempt database connection
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT 1"

# Check affected tables
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"

# Attempt to dump database (will fail if severely corrupted)
docker exec ripplecore-postgres pg_dump -U ripplecore -d ripplecore > /tmp/corruption_test.sql

Decision Point:

  • Minor Corruption (specific table): Restore that table from backup → Go to Step 2A
  • Severe Corruption (multiple tables/entire database): Full restore → Go to Step 2B

Step 2A: Restore Specific Table (30 minutes)

# Download latest backup
s5cmd cp s3://ripplecore-backups/postgres/daily/db_latest.dump.gz /tmp/
gunzip /tmp/db_latest.dump.gz

# Restore to temporary database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore_temp;"
docker exec -i ripplecore-postgres pg_restore -U ripplecore -d ripplecore_temp < /tmp/db_latest.dump

# Export corrupted table from backup
CORRUPTED_TABLE="kindness"
docker exec ripplecore-postgres pg_dump -U ripplecore -d ripplecore_temp -t $CORRUPTED_TABLE \
  > /tmp/${CORRUPTED_TABLE}_restore.sql

# Stop applications (prevent writes during restore)
docker stop ripplecore-app ripplecore-api ripplecore-web

# Drop and restore corrupted table
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "DROP TABLE IF EXISTS $CORRUPTED_TABLE CASCADE;"
docker exec -i ripplecore-postgres psql -U ripplecore -d ripplecore < /tmp/${CORRUPTED_TABLE}_restore.sql

# Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web

# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM $CORRUPTED_TABLE;"

# Cleanup
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore_temp;"

Step 2B: Full Database Restore (45 minutes)

Follow "Step 4A: Restore Database Server" procedure above

Key differences:

  • Keep existing PostgreSQL container running
  • Drop and recreate database instead of full server replacement
# Stop applications
docker stop ripplecore-app ripplecore-api ripplecore-web

# Download latest backup
s5cmd cp s3://ripplecore-backups/postgres/daily/db_latest.dump.gz /tmp/
gunzip /tmp/db_latest.dump.gz

# Drop corrupted database
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore;"

# Create fresh database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore;"

# Restore from backup
docker exec -i ripplecore-postgres pg_restore \
  -U ripplecore \
  -d ripplecore \
  --clean \
  --if-exists \
  --verbose \
  < /tmp/db_latest.dump

# Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"

# Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web

# Verify health
curl https://app.your-domain.com/api/health

Step 3: Investigate Root Cause (20 minutes)

Common Causes:

  • Disk corruption (check SMART status)
  • Out-of-memory killer (OOM) during writes
  • Sudden power loss (unlikely in cloud environment)
  • PostgreSQL bug (rare)

Investigation:

# Check disk health
smartctl -a /dev/sda

# Check for OOM events
dmesg | grep -i "out of memory"
grep -i "oom" /var/log/syslog

# Check PostgreSQL settings
docker exec ripplecore-postgres psql -U ripplecore -c "SHOW all;"

# Review recent PostgreSQL logs
docker logs ripplecore-postgres --since 24h | grep -E "ERROR|FATAL|PANIC"

Prevention:

  • Enable PostgreSQL checksums (detect corruption early)
  • Increase shared_buffers and effective_cache_size (reduce disk I/O)
  • Enable full_page_writes (prevent corruption after crashes)
  • Regular VACUUM operations (prevent bloat)

Scenario 3: Accidental Data Deletion

Symptoms:

  • User reports missing data
  • Audit logs show DELETE operations
  • Table has fewer records than expected

Recovery Time: 30 minutes


Procedure

See "Selective Table Restore" in BACKUP_RECOVERY.md

Key Steps:

  1. Identify deletion timestamp from user
  2. Find backup before deletion
  3. Restore to temporary database
  4. Extract deleted records (compare with production)
  5. Import recovered records back to production
  6. Verify with user

Post-Recovery Checklist

After any disaster recovery:

  • All services are operational
  • Monitoring shows green status
  • Users can access application
  • Data integrity verified (sample checks)
  • Backup process tested after recovery
  • Team notified of resolution
  • Incident documentation created
  • Post-mortem scheduled (within 24h)
  • Runbook updated with lessons learned
  • Infrastructure documentation updated
  • Old/failed resources cleaned up (after grace period)

Testing & Drills

Quarterly DR Drill Schedule:

  • Q1: Database server failure simulation
  • Q2: Application server failure simulation
  • Q3: Complete datacenter failure simulation (switch to backup region if multi-region)
  • Q4: Ransomware attack simulation

Drill Checklist:

  • Schedule 2-hour maintenance window
  • Notify team of drill in advance
  • Execute DR procedure
  • Measure actual RTO vs. target
  • Document issues encountered
  • Update runbook with improvements
  • Share lessons learned with team

Last Updated: [Date] Last Tested: [Date] Next DR Drill: [Date] Document Owner: [Your Name]