RippleCore
Infrastructure

Deployment Checklist

Pre-launch verification checklist for production deployment

Production Deployment Checklist

Pre-launch verification checklist for production deployment

Purpose: Ensure all critical systems are configured and tested before launch Estimated Time: 4-6 hours (first-time setup) Review Cycle: Before each major production deployment


Table of Contents


Phase 1: Infrastructure Setup

Hetzner Cloud Servers

  • Production App Server (CPX32)

    • Server created in Falkenstein datacenter (fsn1)
    • Ubuntu 24.04 LTS installed
    • SSH key configured
    • Private network assigned (10.0.1.0/24)
    • Floating IP assigned
    • Cloud firewall applied
    • Hostname set: ripplecore-app-prod
  • Production DB Server (CPX22)

    • Server created in Falkenstein datacenter (fsn1)
    • Ubuntu 24.04 LTS installed
    • SSH key configured
    • Private network assigned (10.0.1.0/24)
    • Cloud firewall applied (database ports private only)
    • Hostname set: ripplecore-db-prod
  • CI/CD Server (CPX11)

    • Server created in Falkenstein datacenter (fsn1)
    • Ubuntu 24.04 LTS installed
    • SSH key configured
    • Private network assigned (10.0.1.0/24)
    • Floating IP assigned (optional)
    • Hostname set: ripplecore-cicd-prod
  • Staging Server (CPX22)

    • Server created in Falkenstein datacenter (fsn1)
    • Ubuntu 24.04 LTS installed
    • Private network assigned (10.0.2.0/24)
    • Hostname set: ripplecore-staging

Network Configuration

  • Private Network

    • Network created: ripplecore-prod-network
    • Subnet: 10.0.1.0/24 (production)
    • Subnet: 10.0.2.0/24 (staging)
    • All servers attached to appropriate subnets
  • Firewall Rules

    • Firewall created: ripplecore-prod-firewall
    • Inbound rules configured (80, 443, 22 restricted)
    • Outbound rules configured (allow all for updates)
    • Applied to all production servers
  • Floating IPs

    • Production app floating IP assigned
    • Floating IP DNS records configured

DNS Configuration

  • Domain Records Configured

    • app.your-domain.com → App server floating IP
    • api.your-domain.com → App server floating IP
    • www.your-domain.com → App server floating IP
    • staging.your-domain.com → Staging server IP
    • dokploy.your-domain.com → CI/CD server IP
  • DNS Propagation

    • All records propagated (check with dig +short)
    • TTL set appropriately (300s for production, 60s for staging)

Object Storage

  • Hetzner Object Storage
    • Bucket created: ripplecore-backups
    • Access key generated and saved securely
    • Lifecycle policies configured (7/28/365 day retention)
    • Versioning enabled
    • s5cmd installed and configured on DB server

Phase 2: Application Configuration

Database Setup

  • PostgreSQL 18

    • Docker container running (ripplecore-postgres)
    • Database created: ripplecore
    • User created: ripplecore with secure password
    • Accessible only via private network (10.0.1.3:5432)
    • Migrations applied (Drizzle schema pushed)
    • Test query successful
    docker exec ripplecore-postgres psql -U ripplecore -d ripplecore -c "SELECT COUNT(*) FROM users;"
  • Redis 7

    • Docker container running (ripplecore-redis)
    • AOF persistence enabled
    • RDB snapshots configured (hourly)
    • Accessible only via private network (10.0.1.3:6379)
    • Test connection successful
    docker exec ripplecore-redis redis-cli ping
    # Expected: PONG

Application Deployment

  • Dokploy Installation

    • Dokploy installed on CI/CD server
    • Traefik reverse proxy running
    • Admin account created
    • HTTPS certificates configured (Let's Encrypt)
  • Main Application (Next.js)

    • Deployed via Dokploy
    • Environment variables configured
    • Health endpoint accessible: /api/health
    • Domain SSL working: https://app.your-domain.com
    • 2 replicas running (zero-downtime deployments)
  • API Server (Next.js)

    • Deployed via Dokploy
    • Environment variables configured
    • Health endpoint accessible: /api/health
    • Domain SSL working: https://api.your-domain.com
  • Marketing Website (Next.js)

    • Deployed via Dokploy
    • Environment variables configured
    • Health endpoint accessible: /api/health
    • Domain SSL working: https://www.your-domain.com

Environment Variables

  • Production Environment Variables Set

    Critical Variables (verify in Dokploy for each app):

    DATABASE_URL=postgresql://ripplecore:<secret>@10.0.1.3:5432/ripplecore_prod
    REDIS_URL=redis://:<secret>@10.0.1.3:6379
    BETTER_AUTH_SECRET=<32-char-secret>  # Generated with: npx @better-auth/cli secret
    BETTER_AUTH_URL=https://app.your-domain.com
    BETTER_AUTH_TRUST_HOST=true
    NODE_ENV=production
    NEXT_PUBLIC_APP_URL=https://app.your-domain.com
    STRICT_HEALTH_CHECK=true
    ADMIN_USER_IDS=<comma-separated-user-ids>

    Optional Services (if configured):

    SENTRY_DSN=<sentry-dsn>
    ARCJET_KEY=<arcjet-api-key>
    SLACK_WEBHOOK_URL=<slack-webhook>
  • Secrets Validation

    • All secrets generated (not default values)
    • Secrets stored in 1Password vault
    • No secrets committed to Git
    • .env.local files gitignored

Phase 3: Security & Compliance

SSH Hardening

  • SSH Configuration (on all servers)

    # /etc/ssh/sshd_config
    PasswordAuthentication no
    PermitRootLogin prohibit-password
    PubkeyAuthentication yes
  • fail2ban Installed

    sudo apt install fail2ban
    sudo systemctl enable fail2ban
    sudo systemctl start fail2ban
  • SSH Keys

    • Only authorized team keys in ~/.ssh/authorized_keys
    • No password-based authentication
    • SSH access tested from authorized IPs only

Application Security

  • Security Headers Verified

    Test with:

    curl -I https://app.your-domain.com | grep -E "X-|Strict|Content-Security"

    Expected headers:

    • Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
    • X-Frame-Options: DENY
    • X-Content-Type-Options: nosniff
    • Content-Security-Policy: [configured]
    • Permissions-Policy: [configured]
  • SSL/TLS Configuration

    • Let's Encrypt certificates installed
    • Auto-renewal configured (Traefik handles this)
    • SSL Labs grade: A or A+
    • HTTP → HTTPS redirect working
    • HSTS preload enabled
  • Rate Limiting

    • Arcjet API key configured (if using)
    • Rate limits tested:
      • Public endpoints: 10 req/10s per IP
      • Authenticated: 100 req/min per user
      • Admin: 50 req/min per user
    # Test rate limiting
    for i in {1..15}; do curl https://app.your-domain.com/api/kindness; done
    # Should return 429 after limit exceeded
  • Authentication Security

    • better-auth configured with secure settings
    • Session expiry: 8 hours (PRD requirement)
    • Secure cookies enabled (useSecureCookies: true)
    • Session rotation enabled (updateAge: 3600)

Database Security

  • Database Access

    • Database NOT exposed to public internet
    • Accessible only via private network (10.0.1.0/24)
    • Strong password configured (>32 characters)
    • Audit logging enabled (PostgreSQL logs queries)
  • Multi-Tenant Isolation

    • All queries filter by organizationId
    • Cache keys include organization scope
    • Session includes activeOrganizationId
    • Test cross-tenant access (should be blocked)
    # Verify organization isolation
    curl https://app.your-domain.com/api/kindness \
      -H "Authorization: Bearer <org1-token>" \
      # Should only return org1 data, not org2

Compliance

  • GDPR Compliance (if applicable)

    • Data processing agreement with Hetzner signed
    • User data stored in EU (Hetzner Falkenstein, Germany)
    • Privacy policy published
    • Cookie consent implemented
    • Data export functionality available
    • Data deletion functionality available
  • Security Documentation


Phase 4: Monitoring & Alerting

Infrastructure Monitoring

  • Netdata Installed (on all servers)

    # Verify Netdata running
    curl http://localhost:19999/api/v1/info
  • Netdata Alerts Configured

    • Custom alerts created (/etc/netdata/health.d/custom.conf)
    • Slack integration configured
    • Test alert sent successfully
    /usr/libexec/netdata/plugins.d/alarm-notify.sh test
  • Netdata Cloud (optional)

    • Servers claimed to Netdata Cloud
    • Team dashboard accessible
    • All servers showing in dashboard

Uptime Monitoring

Error Tracking

  • Sentry Configured

    • Sentry DSN in environment variables
    • Source maps uploaded (for stack traces)
    • Error alerts configured
    • Performance monitoring enabled
    • Test error sent successfully
    # Trigger test error in app
    curl https://app.your-domain.com/api/sentry-test
    # Verify error appears in Sentry dashboard
  • Sentry Alerts

    • High error rate alert (>100/hour)
    • New error type alert
    • Performance degradation alert (>1s endpoints)

Slack Notifications

  • Slack Channels Created

    • #deployments - Deployment notifications
    • #alerts - Critical alerts
    • #ops-alerts - Infrastructure alerts
    • #database-alerts - Database-specific alerts
  • Webhooks Configured

    • Netdata → #ops-alerts
    • UptimeRobot → #alerts
    • Sentry → #alerts
    • GitHub Actions → #deployments

Phase 5: Backup & Disaster Recovery

Automated Backups

  • Backup Scripts Deployed

    • /root/scripts/backup-db.sh on DB server
    • /root/scripts/test-restore.sh on DB server
    • Scripts executable (chmod +x)
    • Slack webhook configured in scripts
  • Cron Jobs Configured

    crontab -l
    # Expected:
    # 0 3 * * * /root/scripts/backup-db.sh >> /var/log/backup.log 2>&1
    # 0 4 * * 0 /root/scripts/test-restore.sh >> /var/log/backup.log 2>&1
  • Backup Verification

    • Manual backup test successful

      /root/scripts/backup-db.sh
      # Verify backup in S3
      s5cmd ls s3://ripplecore-backups/postgres/daily/
    • Backup size reasonable (>10MB compressed)

    • Slack notification received

Restore Testing

  • Manual Restore Test

    • Download backup from S3
    • Restore to test database
    • Verify data integrity
    • Document restore time (should be <30 minutes)
  • Automated Weekly Test

    • test-restore.sh executed successfully
    • All integrity checks passed
    • Slack notification received

Disaster Recovery

  • DR Documentation Complete

    • Disaster recovery runbook created
    • Emergency contacts documented
    • Credentials stored in 1Password
    • Team trained on DR procedures
  • DR Drill Scheduled

    • First quarterly drill scheduled
    • Drill checklist prepared
    • Team notified of drill schedule

Phase 6: Performance & Load Testing

Health Endpoints

  • Health Check Verification

    All endpoints should return 200 OK with valid JSON:

    curl https://app.your-domain.com/api/health | jq
    curl https://api.your-domain.com/api/health | jq
    curl https://www.your-domain.com/api/health | jq

    Expected response:

    {
      "status": "ok",
      "timestamp": "2025-01-23T12:00:00.000Z",
      "service": "ripplecore-app",
      "version": "1.0.0",
      "environment": "production",
      "checks": {
        "database": { "status": "ok" },
        "redis": { "status": "ok" }
      }
    }

Response Time Testing

  • API Performance

    • Test major endpoints with curl or Postman
    • Health endpoint: <100ms
    • API endpoints: <200ms
    • Page load times: <1s
    # Test response time
    curl -w "@curl-format.txt" -o /dev/null -s https://app.your-domain.com/api/health
    
    # curl-format.txt content:
    # time_namelookup:  %{time_namelookup}\n
    # time_connect:  %{time_connect}\n
    # time_appconnect:  %{time_appconnect}\n
    # time_pretransfer:  %{time_pretransfer}\n
    # time_starttransfer:  %{time_starttransfer}\n
    # time_total:  %{time_total}\n

Load Testing

  • Basic Load Test (Apache Bench)

    # Test health endpoint under load
    ab -n 1000 -c 50 https://app.your-domain.com/api/health
    
    # Expected: 100% success rate, <200ms avg response time
  • Realistic Load Test (optional - k6 or JMeter)

    • Simulate 100 concurrent users
    • Test critical user journeys
    • Monitor server resources during test
    • Verify auto-scaling triggers (if configured)

Database Performance

  • Connection Pool Testing

    • Verify max connections not exceeded under load
    • Test connection timeout behavior
    • Monitor query performance
    # Check active connections
    docker exec ripplecore-postgres psql -U ripplecore -c "SELECT count(*) FROM pg_stat_activity;"
  • Slow Query Monitoring

    • Enable slow query logging (queries >500ms)
    • Review initial slow queries
    • Add missing indexes if needed

Phase 7: Final Pre-Launch

User Acceptance Testing

  • Critical User Flows

    • User registration and email verification
    • User login and logout
    • Create organization
    • Invite team member
    • Log evidence (kindness, volunteer, donation, wellbeing)
    • View analytics dashboard
    • Export compliance reports
    • Admin license management
  • Cross-Browser Testing

    • Chrome (latest)
    • Firefox (latest)
    • Safari (latest)
    • Edge (latest)
    • Mobile Safari (iOS)
    • Chrome Mobile (Android)
  • Accessibility Testing

    • Keyboard navigation works
    • Screen reader tested (basic)
    • Color contrast checked
    • Focus indicators visible

Documentation

  • Public Documentation

    • User guide published
    • API documentation published (if public API)
    • Help center/FAQ published
    • Privacy policy published
    • Terms of service published
  • Internal Documentation

    • Infrastructure documentation complete
    • Deployment procedures documented
    • Troubleshooting guides created
    • Runbooks finalized

Team Readiness

  • On-Call Rotation

    • On-call schedule created
    • On-call contacts documented
    • Escalation procedures defined
  • Team Training

    • Team trained on monitoring dashboards
    • Team trained on deployment process
    • Team trained on incident response
    • Team trained on DR procedures

Launch Checklist

  • Pre-Launch

    • All critical bugs fixed
    • All smoke tests passing
    • Monitoring dashboards all green
    • Backup verified in last 24 hours
    • DR plan tested in last 30 days
    • Security scan completed (no critical issues)
    • Performance baseline established
  • Launch Day

    • On-call team available
    • Incident response plan ready
    • Rollback plan prepared
    • Stakeholders notified of launch
    • Monitoring actively watched
  • Post-Launch (first 24 hours)

    • Monitor error rates (should be <0.1%)
    • Monitor response times (should be <200ms)
    • Monitor server resources (CPU <70%, RAM <80%)
    • Verify backups completed successfully
    • Review user feedback
    • Document any issues encountered

Verification Script

Quick verification script to check critical services:

#!/bin/bash
# verify-deployment.sh

echo "=========================================="
echo "Production Deployment Verification"
echo "=========================================="

# DNS checks
echo "✓ Checking DNS records..."
dig +short app.your-domain.com
dig +short api.your-domain.com
dig +short www.your-domain.com

# Health checks
echo "✓ Checking health endpoints..."
curl -f https://app.your-domain.com/api/health || echo "❌ App health failed"
curl -f https://api.your-domain.com/api/health || echo "❌ API health failed"
curl -f https://www.your-domain.com/api/health || echo "❌ Web health failed"

# SSL checks
echo "✓ Checking SSL certificates..."
echo | openssl s_client -connect app.your-domain.com:443 2>/dev/null | openssl x509 -noout -dates

# Database connectivity (from app server)
echo "✓ Checking database connectivity..."
docker exec ripplecore-app curl -f http://10.0.1.3:5432 || echo "❌ Database unreachable"

# Redis connectivity (from app server)
echo "✓ Checking Redis connectivity..."
docker exec ripplecore-app curl -f http://10.0.1.3:6379 || echo "❌ Redis unreachable"

# Monitoring
echo "✓ Checking Netdata..."
curl -f http://localhost:19999/api/v1/info || echo "❌ Netdata not running"

# Backups
echo "✓ Checking latest backup..."
s5cmd ls s3://ripplecore-backups/postgres/daily/ | tail -1

echo "=========================================="
echo "Verification Complete"
echo "=========================================="

Sign-Off

Deployment Authorized By:

  • Technical Lead: **____** Date: **__**
  • DevOps Lead: ****_**** Date: **__**
  • Security Lead: **____** Date: **__**
  • CTO/VP Engineering: **___** Date: **__**

Document Version: 1.0 Last Updated: 2025-01-23 Next Review: Before next major production deployment