Backup & Disaster Recovery
Comprehensive backup automation and disaster recovery procedures
Backup & Disaster Recovery Guide
Comprehensive backup automation and disaster recovery procedures
RTO: 2 hours (Recovery Time Objective) RPO: 24 hours (Recovery Point Objective - daily backups) Backup Storage: Hetzner Object Storage (S3-compatible)
Table of Contents
- Backup Strategy
- Automated Backup System
- Restore Procedures
- Disaster Recovery Scenarios
- Testing & Validation
- Backup Monitoring
Backup Strategy
Backup Scope
What Gets Backed Up:
Components Backed Up:
- PostgreSQL Database: Daily (3 AM), 7 days/4 weeks/12 months retention, Hetzner Object Storage (~500MB compressed)
- Redis Data: Hourly (RDB) + AOF, 7 days retention, Local disk + S3 daily (~100MB)
- Environment Variables: On change, Indefinite (Git), Private Git repo (<1MB)
- Dokploy Configuration: Weekly, 4 weeks retention, Hetzner Object Storage (~10MB)
- SSL Certificates: Weekly, 4 weeks retention, Hetzner Object Storage (<1MB)
- Application Code: On commit, Indefinite, GitHub (N/A source control)
- User Uploads: Daily, 30 days, Vercel Blob (already backed up, Variable size)
What Does NOT Get Backed Up:
- Docker images (rebuilt from source)
- Node modules (rebuilt from package.json)
- Temporary files and caches
- System packages (reinstalled from OS)
Backup Retention Policy
Grandfather-Father-Son Strategy:
Daily Backups (Son):
Frequency: Every day at 3 AM UTC
Retention: 7 days
Location: s3://ripplecore-backups/postgres/daily/
Weekly Backups (Father):
Frequency: Every Sunday at 3 AM UTC
Retention: 4 weeks
Location: s3://ripplecore-backups/postgres/weekly/
Monthly Backups (Grandfather):
Frequency: 1st of month at 3 AM UTC
Retention: 12 months
Location: s3://ripplecore-backups/postgres/monthly/Storage Cost Estimate:
- Daily: 500MB × 7 = 3.5GB
- Weekly: 500MB × 4 = 2GB
- Monthly: 500MB × 12 = 6GB
- Total: ~12GB × €0.005/GB = €0.06/month
Recovery Objectives
RTO (Recovery Time Objective): 2 hours
- Time from disaster to full service restoration
- Breakdown:
- Provision new server: 15 minutes
- Install software: 30 minutes
- Restore database: 30 minutes
- Deploy applications: 30 minutes
- DNS propagation: 15 minutes
RPO (Recovery Point Objective): 24 hours
- Maximum acceptable data loss
- Daily backups at 3 AM = worst case 24 hours of data loss
- Can be reduced to 1 hour with hourly backups (additional cost)
Service Level Agreement:
- Uptime Target: 99.5% (3.6 hours downtime/month acceptable)
- Data Durability: 99.999999999% (11 nines - Hetzner Object Storage)
- Backup Success Rate: 100% (validated weekly)
Automated Backup System
Hetzner Object Storage Setup
Step 1: Create Object Storage Bucket
Via Hetzner Cloud Console:
Navigate to: Cloud → Object Storage → Create Bucket
Bucket Name: ripplecore-backups
Region: eu-central (Falkenstein)
Versioning: Enabled (retain 3 versions)
Lifecycle Rules:
- Delete daily backups older than 7 days
- Delete weekly backups older than 28 days
- Delete monthly backups older than 365 daysStep 2: Generate Access Keys
Object Storage → Credentials → Generate New Key
Key Name: db-backup-production
Permissions: Read/Write
Buckets: ripplecore-backups
# Save credentials securely
Access Key: S3RVER1234567890
Secret Key: <long-secret-string>Step 3: Install s5cmd (Fast S3 Client)
On database server:
# Download s5cmd (faster than aws-cli)
curl -L https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz \
| tar xz -C /usr/local/bin
# Verify installation
s5cmd version
# Configure credentials
mkdir -p ~/.aws
cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
EOF
cat > ~/.aws/config <<EOF
[default]
region = eu-central
endpoint_url = https://fsn1.your-objectstorage.com
EOFPostgreSQL Backup Automation
Backup Script: /root/scripts/backup-db.sh
See scripts/backup-db.sh for complete implementation.
Features:
- PostgreSQL full dump (pg_dump with custom format)
- Gzip compression (~70% size reduction)
- Upload to Hetzner Object Storage (S3)
- Grandfather-Father-Son retention
- Backup verification (checksum)
- Slack notification on success/failure
- Automatic cleanup of old backups
Schedule (crontab):
# Edit root crontab on database server
crontab -e
# Daily backup at 3 AM UTC
0 3 * * * /root/scripts/backup-db.sh >> /var/log/backup.log 2>&1
# Weekly restore test (Sundays at 4 AM)
0 4 * * 0 /root/scripts/test-restore.sh >> /var/log/backup.log 2>&1Redis Backup Strategy
Persistence Configuration:
File: /etc/redis/redis.conf (or Docker volume mount)
# RDB Snapshots (point-in-time backups)
save 900 1 # Save if 1 key changed in 15 minutes
save 300 10 # Save if 10 keys changed in 5 minutes
save 60 10000 # Save if 10,000 keys changed in 1 minute
# RDB file location
dir /data/redis
dbfilename dump.rdb
# Enable compression
rdbcompression yes
rdbchecksum yes
# AOF (Append-Only File) for durability
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec # Sync to disk every second
# AOF rewrite (compact log file)
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mbRedis Backup Script: /root/scripts/backup-redis.sh
#!/bin/bash
set -e
BACKUP_DIR="/backups/redis"
S3_BUCKET="s3://ripplecore-backups/redis"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# Trigger Redis background save
docker exec ripplecore-redis redis-cli BGSAVE
# Wait for save to complete
while [ $(docker exec ripplecore-redis redis-cli LASTSAVE) -eq $(docker exec ripplecore-redis redis-cli LASTSAVE) ]; do
sleep 1
done
# Copy RDB file
docker cp ripplecore-redis:/data/dump.rdb $BACKUP_DIR/redis_$DATE.rdb
# Compress
gzip $BACKUP_DIR/redis_$DATE.rdb
# Upload to S3
s5cmd cp $BACKUP_DIR/redis_$DATE.rdb.gz $S3_BUCKET/
# Cleanup old local backups (keep last 7 days)
find $BACKUP_DIR -name "*.rdb.gz" -mtime +7 -delete
echo "Redis backup complete: redis_$DATE.rdb.gz"Schedule:
# Daily Redis backup at 3:30 AM (after PostgreSQL)
30 3 * * * /root/scripts/backup-redis.sh >> /var/log/backup.log 2>&1Configuration Backup
Dokploy Configuration Export:
#!/bin/bash
# /root/scripts/backup-dokploy.sh
BACKUP_DIR="/backups/dokploy"
S3_BUCKET="s3://ripplecore-backups/config"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# Export Dokploy database (SQLite)
docker exec dokploy-db sqlite3 /app/data/dokploy.db ".backup /tmp/dokploy_$DATE.db"
docker cp dokploy-db:/tmp/dokploy_$DATE.db $BACKUP_DIR/
# Backup Traefik configuration
docker cp dokploy-traefik:/etc/traefik $BACKUP_DIR/traefik_$DATE/
# Backup SSL certificates
docker cp dokploy-traefik:/letsencrypt $BACKUP_DIR/letsencrypt_$DATE/
# Compress everything
tar czf $BACKUP_DIR/dokploy_config_$DATE.tar.gz \
$BACKUP_DIR/dokploy_$DATE.db \
$BACKUP_DIR/traefik_$DATE/ \
$BACKUP_DIR/letsencrypt_$DATE/
# Upload to S3
s5cmd cp $BACKUP_DIR/dokploy_config_$DATE.tar.gz $S3_BUCKET/
# Cleanup
rm -rf $BACKUP_DIR/dokploy_$DATE.db $BACKUP_DIR/traefik_$DATE/ $BACKUP_DIR/letsencrypt_$DATE/
find $BACKUP_DIR -name "*.tar.gz" -mtime +28 -delete
echo "Dokploy configuration backup complete"Schedule:
# Weekly on Sundays at 5 AM
0 5 * * 0 /root/scripts/backup-dokploy.sh >> /var/log/backup.log 2>&1Restore Procedures
PostgreSQL Database Restore
Scenario: Restore production database from backup
Prerequisites:
- Database server is running
- Backup file is available in S3 or local storage
- No active connections to database (stop apps first)
Procedure:
# 1. List available backups
s5cmd ls s3://ripplecore-backups/postgres/daily/
# 2. Download desired backup
BACKUP_FILE="db_20250123_030000.dump.gz"
s5cmd cp s3://ripplecore-backups/postgres/daily/$BACKUP_FILE /tmp/
# 3. Decompress
gunzip /tmp/$BACKUP_FILE
# Result: /tmp/db_20250123_030000.dump
# 4. Stop applications (prevent writes during restore)
docker stop ripplecore-app ripplecore-api ripplecore-web
# 5. Drop existing database (DESTRUCTIVE - confirm first)
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore;"
# 6. Create fresh database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore;"
# 7. Restore from backup
docker exec -i ripplecore-postgres pg_restore \
-U ripplecore \
-d ripplecore \
--verbose \
--clean \
--if-exists \
< /tmp/db_20250123_030000.dump
# 8. Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\dt"
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"
# 9. Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web
# 10. Verify applications are healthy
curl https://app.your-domain.com/api/health
# 11. Cleanup
rm -f /tmp/db_20250123_030000.dumpEstimated Time: 20-30 minutes (depending on database size)
Redis Data Restore
Scenario: Restore Redis cache/sessions from backup
# 1. Download Redis backup
s5cmd cp s3://ripplecore-backups/redis/redis_20250123_033000.rdb.gz /tmp/
# 2. Decompress
gunzip /tmp/redis_20250123_033000.rdb.gz
# 3. Stop Redis container
docker stop ripplecore-redis
# 4. Replace RDB file
docker cp /tmp/redis_20250123_033000.rdb ripplecore-redis:/data/dump.rdb
# 5. Start Redis
docker start ripplecore-redis
# 6. Verify data restored
docker exec ripplecore-redis redis-cli DBSIZE
docker exec ripplecore-redis redis-cli INFO keyspace
# 7. Cleanup
rm -f /tmp/redis_20250123_033000.rdbNote: Redis restores are non-destructive to applications since Redis is a cache. Sessions will be recreated on next user login.
Selective Table Restore
Scenario: Restore only specific table from backup (e.g., accidentally deleted data)
# 1. Download and decompress backup
s5cmd cp s3://ripplecore-backups/postgres/daily/db_20250123_030000.dump.gz /tmp/
gunzip /tmp/db_20250123_030000.dump.gz
# 2. Restore to temporary database
docker exec ripplecore-postgres psql -U ripplecore -c "CREATE DATABASE ripplecore_temp;"
docker exec -i ripplecore-postgres pg_restore \
-U ripplecore \
-d ripplecore_temp \
< /tmp/db_20250123_030000.dump
# 3. Extract specific table data
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore_temp -c "\COPY kindness TO '/tmp/kindness_restore.csv' CSV HEADER"
# 4. Import to production database
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "\COPY kindness FROM '/tmp/kindness_restore.csv' CSV HEADER"
# 5. Verify restoration
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM kindness;"
# 6. Cleanup temporary database
docker exec ripplecore-postgres psql -U ripplecore -c "DROP DATABASE ripplecore_temp;"
rm -f /tmp/db_20250123_030000.dump /tmp/kindness_restore.csvDisaster Recovery Scenarios
Scenario 1: Database Corruption
Detection:
- PostgreSQL errors in logs:
corrupted page detected - Application errors:
relation does not exist - Backup validation failures
Response (RTO: 1 hour):
# 1. Immediately stop applications (prevent further corruption)
docker stop ripplecore-app ripplecore-api ripplecore-web
# 2. Assess corruption extent
docker exec ripplecore-postgres pg_dumpall --schema-only > /tmp/schema_check.sql
# 3. If corruption is severe, restore from latest backup
# Follow "PostgreSQL Database Restore" procedure above
# 4. If corruption is minor, attempt repair
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "REINDEX DATABASE ripplecore;"
# 5. Verify database integrity
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT * FROM pg_stat_database WHERE datname='ripplecore';"
# 6. Restart applications
docker start ripplecore-app ripplecore-api ripplecore-web
# 7. Monitor for errors
docker logs ripplecore-app --follow | grep ERRORPrevention:
- Enable PostgreSQL checksums (detect corruption early)
- Regular backup validation (weekly automated tests)
- Monitor disk health (SMART monitoring)
Scenario 2: Complete Server Failure
Detection:
- Server unreachable via SSH
- Hetzner Cloud Console shows server offline
- All applications down simultaneously
Response (RTO: 2 hours):
Step 1: Provision New Server (15 minutes)
# Via Hetzner Cloud Console
# 1. Create new server (same specs: CPX32 for app, CPX22 for DB)
# 2. Use same SSH key
# 3. Assign to same private network (10.0.1.0/24)
# 4. Reassign floating IP to new server (instant failover)Step 2: Install Base Software (30 minutes)
# SSH into new server
ssh root@new-server-ip
# Install Docker
curl -fsSL https://get.docker.com | sh
# Install Dokploy (if replacing CI/CD server)
curl -sSL https://dokploy.com/install.sh | sh
# Install monitoring
bash <(curl -Ss https://my-netdata.io/kickstart.sh)Step 3: Restore Database (30 minutes)
# Start PostgreSQL container
docker run -d \
--name ripplecore-postgres \
-e POSTGRES_USER=ripplecore \
-e POSTGRES_PASSWORD=<secret> \
-p 5432:5432 \
-v postgres-data:/var/lib/postgresql/data \
postgres:18-alpine
# Restore from backup (follow procedure above)
s5cmd cp s3://ripplecore-backups/postgres/daily/db_latest.dump.gz /tmp/
gunzip /tmp/db_latest.dump.gz
docker exec -i ripplecore-postgres pg_restore -U ripplecore -d ripplecore < /tmp/db_latest.dumpStep 4: Deploy Applications (30 minutes)
# Deploy via Dokploy (if configured) or manual Docker
# Follow deployment instructions in DEPLOYMENT.md
# Verify health
curl https://app.your-domain.com/api/healthStep 5: DNS Update (if not using floating IP)
# Update DNS A records to point to new server IP
# Propagation time: 5-60 minutes depending on TTLScenario 3: Accidental Data Deletion
Detection:
- User reports missing data
- Database shows unexpected DELETE operations in logs
Response (RTO: 30 minutes):
# 1. Identify deletion timestamp
# Ask user: "When did you last see the data?"
DELETION_TIME="2025-01-23 14:30:00"
# 2. Find backup before deletion
# List backups before that time
s5cmd ls s3://ripplecore-backups/postgres/daily/ | grep "20250123"
# 3. Download backup before deletion
s5cmd cp s3://ripplecore-backups/postgres/daily/db_20250123_030000.dump.gz /tmp/
# 4. Restore to temporary database (follow "Selective Table Restore" above)
# 5. Extract deleted records
docker exec ripplecore-postgres psql -U ripplecore -d ripplecore_temp \
-c "SELECT * FROM kindness WHERE created_at < '$DELETION_TIME' AND id NOT IN (SELECT id FROM ripplecore.kindness);" \
> /tmp/deleted_records.csv
# 6. Import recovered records
docker exec -i ripplecore-postgres psql -U ripplecore -d ripplecore \
< /tmp/deleted_records.csv
# 7. Verify with user
# "Please check if your data is restored"Prevention:
- Implement soft deletes (mark as deleted instead of DROP)
- Database triggers for audit logging
- Point-in-time recovery (requires WAL archiving - see Advanced section)
Scenario 4: Ransomware Attack
Detection:
- Files encrypted with unusual extensions
- Ransom note in file system
- Database access errors
Response (RTO: 3 hours):
DO NOT PAY RANSOM
# 1. Immediately isolate infected server
# Hetzner Console → Server → Network → Disable all interfaces
# 2. Provision completely new infrastructure
# Fresh servers, new IP addresses
# 3. Restore from OFFLINE backups (S3 versioned backups)
# Verify backups were NOT accessed by attacker
s5cmd ls s3://ripplecore-backups/ --show-versions
# 4. Restore to new infrastructure
# Follow "Complete Server Failure" procedure
# 5. Audit attack vector
# Review SSH logs, application logs
# Implement additional security (fail2ban, 2FA for SSH)
# 6. Report to authorities and customers (if PII compromised - GDPR)Prevention:
- Immutable backups (S3 Object Lock)
- Network segmentation (private database network)
- Regular security updates
- fail2ban for brute-force protection
Testing & Validation
Weekly Automated Restore Test
Script: /root/scripts/test-restore.sh
See scripts/test-restore.sh for complete implementation.
Purpose:
- Verify backups are restorable (not corrupted)
- Validate backup automation is working
- Practice disaster recovery procedures
- Meet compliance requirements
Features:
- Download latest backup from S3
- Restore to temporary database
- Run data integrity checks
- Compare record counts with production
- Send Slack notification with results
- Cleanup temporary resources
Success Criteria:
- Backup file downloads successfully
- Restore completes without errors
- Record counts match production (within 24h delta)
- All critical tables present
Schedule:
# Every Sunday at 4 AM (after weekly backup)
0 4 * * 0 /root/scripts/test-restore.sh >> /var/log/backup.log 2>&1Manual DR Drill (Quarterly)
Purpose: Practice complete disaster recovery
Procedure (2-3 hour exercise):
-
Simulate Disaster (10 minutes)
- Pretend production server is completely destroyed
- Document start time
-
Execute DR Plan (2 hours)
- Provision new server
- Restore from backups
- Deploy applications
- Verify functionality
-
Measure Actual RTO (5 minutes)
- Document end time
- Calculate actual recovery time
- Compare to 2-hour RTO objective
-
Document Lessons Learned (30 minutes)
- What went well?
- What took longer than expected?
- Update DR procedures accordingly
-
Update Runbooks (15 minutes)
- Incorporate improvements
- Update time estimates
- Add missing steps
Next Drill: Schedule 3 months from now
Backup Monitoring
Backup Success Verification
Daily Health Check:
#!/bin/bash
# /root/scripts/verify-backups.sh
# Run daily at 6 AM (3 hours after backup)
DATE=$(date +%Y%m%d)
# Check if today's backup exists in S3
if s5cmd ls s3://ripplecore-backups/postgres/daily/ | grep -q "db_${DATE}"; then
echo "✅ Today's backup exists: db_${DATE}"
else
echo "❌ Today's backup missing: db_${DATE}"
curl -X POST $SLACK_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🚨 Database backup failed for $DATE\"}"
exit 1
fi
# Verify backup size is reasonable (>100MB compressed)
BACKUP_SIZE=$(s5cmd ls s3://ripplecore-backups/postgres/daily/db_${DATE}_*.dump.gz | awk '{print $3}')
if [ $BACKUP_SIZE -gt 100000000 ]; then
echo "✅ Backup size is healthy: $BACKUP_SIZE bytes"
else
echo "⚠️ Backup size is suspiciously small: $BACKUP_SIZE bytes"
curl -X POST $SLACK_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"⚠️ Database backup for $DATE is smaller than expected: $BACKUP_SIZE bytes\"}"
fiSchedule:
0 6 * * * /root/scripts/verify-backups.sh >> /var/log/backup.log 2>&1Slack Notifications
Backup Success:
{
"text": "✅ Database Backup Successful",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Database Backup Completed*"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Date:*\n2025-01-23 03:00 UTC"
},
{
"type": "mrkdwn",
"text": "*Size:*\n456 MB (compressed)"
},
{
"type": "mrkdwn",
"text": "*Location:*\ns3://ripplecore-backups/postgres/daily/"
},
{
"type": "mrkdwn",
"text": "*Status:*\n✅ Success"
}
]
}
]
}Backup Failure:
{
"text": "🚨 Database Backup Failed",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Database Backup FAILED* 🚨"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Date:*\n2025-01-23 03:00 UTC"
},
{
"type": "mrkdwn",
"text": "*Error:*\nConnection to database failed"
},
{
"type": "mrkdwn",
"text": "*Action Required:*\nInvestigate immediately"
}
]
}
]
}Backup Dashboard (Optional - Grafana)
Metrics to Track:
- Backup success rate (target: 100%)
- Backup duration (trend over time)
- Backup size growth (capacity planning)
- Last successful backup timestamp
- Restore test success rate
Grafana Panel Queries (if using Prometheus):
# Backup success rate (last 7 days)
rate(backup_success_total[7d]) / rate(backup_attempts_total[7d]) * 100
# Time since last successful backup
time() - backup_last_success_timestamp
# Backup size growth (30-day trend)
increase(backup_size_bytes[30d])Advanced Topics
Point-in-Time Recovery (PITR)
Use Case: Restore database to exact moment before corruption/deletion
Requirements:
- PostgreSQL WAL (Write-Ahead Logging) archiving
- Continuous archiving to S3
- Base backup + WAL files
Setup (add to future roadmap):
-- Enable WAL archiving in PostgreSQL
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET archive_mode = on;
ALTER SYSTEM SET archive_command = 's5cmd cp %p s3://ripplecore-backups/postgres/wal/%f';
-- Restart PostgreSQL
SELECT pg_reload_conf();Recovery Command:
# Restore to specific point in time
pg_restore --target-time='2025-01-23 14:29:00' ...Cost: ~€5-10/month for WAL storage (recommend only for production)
Encrypted Backups
Use Case: Comply with GDPR/regulations for sensitive data
Implementation:
# Encrypt backup before upload
gpg --symmetric --cipher-algo AES256 /tmp/db_backup.dump
s5cmd cp /tmp/db_backup.dump.gpg s3://ripplecore-backups/postgres/daily/
# Decrypt during restore
s5cmd cp s3://ripplecore-backups/postgres/daily/db_backup.dump.gpg /tmp/
gpg --decrypt /tmp/db_backup.dump.gpg > /tmp/db_backup.dumpKey Management: Store GPG passphrase in 1Password/Bitwarden
Related Documentation
- Infrastructure Overview: See
ARCHITECTURE.md - CI/CD Pipeline: See
CI_CD_PIPELINE.md - Monitoring Setup: See
MONITORING.md - Disaster Recovery Runbook: See
../runbooks/disaster-recovery.mdx
Document Version: 1.0 Last Updated: 2025-01-23 Review Cycle: After each DR drill or incident