RippleCore
Infrastructure

Monitoring & Alerting

Comprehensive monitoring stack for infrastructure and application observability

Monitoring & Alerting Setup Guide

Comprehensive monitoring stack for infrastructure and application observability

Stack: Netdata (infrastructure) + UptimeRobot (uptime) + Sentry (errors) + Docker Logs Coverage: Server metrics + application health + error tracking + uptime monitoring Cost: €0/month (free tiers) or €26/month (paid Sentry)


Table of Contents


Monitoring Architecture

Monitoring Stack Overview

┌──────────────────────────────────────────────────────────────┐
│                   MONITORING LAYERS                           │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│ LAYER 1: INFRASTRUCTURE METRICS (Netdata)                    │
├──────────────────────────────────────────────────────────────┤
│ Production App Server (10.0.1.2):                            │
│  • CPU usage per core, load average                          │
│  • RAM usage, swap, cache                                    │
│  • Disk I/O, read/write rates                                │
│  • Network traffic, connections                              │
│  • Docker container metrics                                  │
│                                                              │
│ Database Server (10.0.1.3):                                  │
│  • PostgreSQL connections, queries/sec                       │
│  • Redis memory usage, hit rate                              │
│  • Disk space, backup storage                                │
│                                                              │
│ CI/CD Server + Staging Server: Same metrics                 │
│                                                              │
│ Real-time: 1-second intervals                                │
│ Retention: 14 days (on-disk)                                 │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│ LAYER 2: APPLICATION HEALTH (UptimeRobot)                    │
├──────────────────────────────────────────────────────────────┤
│ HTTPS Monitors (5-minute checks):                            │
│  • https://app.your-domain.com/api/health                    │
│  • https://api.your-domain.com/api/health                    │
│  • https://www.your-domain.com/api/health                    │
│  • https://staging.your-domain.com/api/health                │
│                                                              │
│ Keyword Monitors (check for "ok" in response):              │
│  • Verify health endpoint returns valid JSON                 │
│                                                              │
│ Alerts: Email, Slack, SMS (paid)                             │
│ Status Page: Public uptime history                           │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│ LAYER 3: ERROR TRACKING (Sentry)                             │
├──────────────────────────────────────────────────────────────┤
│ Application Errors:                                          │
│  • JavaScript errors (frontend)                              │
│  • API errors (backend)                                      │
│  • Database errors                                           │
│  • Authentication failures                                   │
│                                                              │
│ Performance Monitoring:                                      │
│  • Slow API endpoints (>1s)                                  │
│  • Database query performance                                │
│  • Page load times                                           │
│                                                              │
│ User Context: Session ID, user ID, browser info              │
│ Source Maps: Stack traces with original code                │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│ LAYER 4: LOG AGGREGATION (Docker Logs)                       │
├──────────────────────────────────────────────────────────────┤
│ Container Logs (JSON format):                                │
│  • Application logs (stdout/stderr)                          │
│  • Access logs (Traefik)                                     │
│  • Database logs (PostgreSQL)                                │
│  • Cache logs (Redis)                                        │
│                                                              │
│ Log Rotation: 10MB max size, 5 files retained               │
│ Retention: 7 days local storage                              │
│                                                              │
│ Optional: Loki + Grafana for advanced querying              │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│ ALERT ROUTING                                                │
├──────────────────────────────────────────────────────────────┤
│ Critical (Immediate): Slack + Email + SMS (optional)         │
│  • All services down                                         │
│  • Database unreachable                                      │
│  • Disk >95% full                                            │
│                                                              │
│ High (15 min): Slack + Email                                 │
│  • Single service down                                       │
│  • CPU >90% sustained                                        │
│  • High error rate (>5%)                                     │
│                                                              │
│ Medium (1 hour): Email                                       │
│  • High CPU (>80%)                                           │
│  • Slow API endpoints                                        │
│  • Elevated error rate (>1%)                                 │
│                                                              │
│ Low (24 hour digest): Email                                  │
│  • Deprecation warnings                                      │
│  • Info logs                                                 │
└──────────────────────────────────────────────────────────────┘

Netdata Setup

Installation (On Each Server)

Install Netdata Agent:

# SSH into server
ssh root@server-ip

# Install Netdata (auto-detects OS and dependencies)
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# Answer prompts:
# - Install required packages? Yes
# - Enable telemetry? No (privacy)
# - Claim to Netdata Cloud? Optional (for team dashboards)

# Verify installation
systemctl status netdata

# Access local dashboard
# http://server-ip:19999

Repeat for all 4 servers:

  • Production App Server (10.0.1.2)
  • Production DB Server (10.0.1.3)
  • CI/CD Server (10.0.1.4)
  • Staging Server (10.0.2.2)

Custom Alerts Configuration

File: /etc/netdata/health.d/custom.conf

# High CPU Usage Alert
alarm: high_cpu_usage
    on: system.cpu
  lookup: average -3m percentage foreach user,system
   every: 1m
    warn: $this > 80
    crit: $this > 95
    info: CPU usage is critically high ($this%)
      to: sysadmin

# Low Disk Space Alert
alarm: low_disk_space
    on: disk.space
  lookup: average -1m percentage of used
   every: 1m
    warn: $this > 80
    crit: $this > 90
    info: Disk space is running low ($this% used)
      to: sysadmin

# High RAM Usage Alert
alarm: high_memory_usage
    on: system.ram
  lookup: average -3m percentage
   every: 1m
    warn: $this > 85
    crit: $this > 95
    info: RAM usage is critically high ($this%)
      to: sysadmin

# High Load Average Alert
alarm: high_load_average
    on: system.load
  lookup: average -5m
   every: 1m
    warn: $this > 3
    crit: $this > 5
    info: System load average is high ($this)
      to: sysadmin

# PostgreSQL Connection Alert (DB server only)
alarm: high_postgres_connections
    on: postgres.connections_utilization
  lookup: average -1m
   every: 1m
    warn: $this > 80
    crit: $this > 95
    info: PostgreSQL connections are at $this% capacity
      to: dba

# Redis Memory Alert (DB server only)
alarm: high_redis_memory
    on: redis.memory
  lookup: average -1m
   every: 1m
    warn: $this > 768MB
    crit: $this > 900MB
    info: Redis memory usage is $this
      to: dba

# Docker Container Down Alert
alarm: docker_container_down
    on: docker.containers
  lookup: average -30s
   every: 10s
    crit: $this == 0
    info: Docker container stopped unexpectedly
      to: sysadmin

Apply Configuration:

# Restart Netdata to apply alerts
systemctl restart netdata

# Verify alerts loaded
curl http://localhost:19999/api/v1/alarms

Slack Integration

Configure Slack Webhook in Netdata:

File: /etc/netdata/health_alarm_notify.conf

# Enable Slack notifications
SEND_SLACK="YES"

# Slack webhook URL (create in Slack: Apps → Incoming Webhooks)
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

# Default Slack channel
DEFAULT_RECIPIENT_SLACK="#alerts"

# Role-specific channels
role_recipients_slack[sysadmin]="#ops-alerts"
role_recipients_slack[dba]="#database-alerts"

# Customize notification format
SLACK_MESSAGE_FORMAT='\
{ \
  "text": "${status} - ${alarm}", \
  "blocks": [ \
    { \
      "type": "section", \
      "text": { \
        "type": "mrkdwn", \
        "text": "*${status}* - ${alarm}" \
      } \
    }, \
    { \
      "type": "section", \
      "fields": [ \
        { \
          "type": "mrkdwn", \
          "text": "*Server:*\n${host}" \
        }, \
        { \
          "type": "mrkdwn", \
          "text": "*Value:*\n${value}" \
        }, \
        { \
          "type": "mrkdwn", \
          "text": "*Severity:*\n${severity}" \
        }, \
        { \
          "type": "mrkdwn", \
          "text": "*Time:*\n${date} ${time}" \
        } \
      ] \
    } \
  ] \
}'

Test Slack Integration:

# Send test notification
/usr/libexec/netdata/plugins.d/alarm-notify.sh test

# Check Slack channel for test message

Netdata Cloud (Optional - Team Dashboards)

Benefits:

  • Centralized dashboard for all servers
  • Team access with role-based permissions
  • Alert history and analytics
  • Free for <5 nodes

Setup:

# Claim each server to Netdata Cloud
netdata-claim.sh -token=YOUR_CLAIM_TOKEN -rooms=YOUR_ROOM_ID -url=https://app.netdata.cloud

# Access team dashboard
# https://app.netdata.cloud

UptimeRobot Configuration

Account Setup

  1. Create Account: https://uptimerobot.com (Free tier)
  2. Upgrade (optional): Pro plan ($7/month) for 1-minute checks + SMS alerts

Monitor Configuration

Monitor 1: Main App Health

Monitor Type: HTTPS
URL: https://app.your-domain.com/api/health
Friendly Name: Production App (Health)
Monitoring Interval: 5 minutes
Monitor Timeout: 30 seconds

HTTP Settings:
  Method: GET
  Expected Status: 200

Keyword Settings:
  Keyword Type: Keyword exists
  Keyword: "ok" # Verify JSON contains "ok" status

Alert Contacts:
  - Email: admin@your-domain.com
  - Slack: #alerts (via webhook)
  - SMS: +1234567890 (Pro plan only)

Monitor 2: API Server Health

Monitor Type: HTTPS
URL: https://api.your-domain.com/api/health
Friendly Name: Production API (Health)
Monitoring Interval: 5 minutes
Expected Status: 200
Keyword: "ok"

Monitor 3: Marketing Website Health

Monitor Type: HTTPS
URL: https://www.your-domain.com/api/health
Friendly Name: Production Web (Health)
Monitoring Interval: 5 minutes
Expected Status: 200
Keyword: "ok"

Monitor 4: Staging Environment Health

Monitor Type: HTTPS
URL: https://staging.your-domain.com/api/health
Friendly Name: Staging (Health)
Monitoring Interval: 5 minutes
Expected Status: 200
Keyword: "ok"

Monitor 5: Database Server TCP (Private - from CI/CD server)

Monitor Type: Port
IP/Host: 10.0.1.3
Port: 5432
Friendly Name: PostgreSQL (Port Check)
Monitoring Interval: 5 minutes

Monitor 6: Redis Server TCP (Private - from CI/CD server)

Monitor Type: Port
IP/Host: 10.0.1.3
Port: 6379
Friendly Name: Redis (Port Check)
Monitoring Interval: 5 minutes

Slack Integration (UptimeRobot)

Create Slack Webhook:

  1. Slack: Apps → Incoming Webhooks → Add to Slack
  2. Choose channel: #alerts
  3. Copy webhook URL

Configure in UptimeRobot:

Settings → Alert Contacts → Add Alert Contact

Contact Type: Web-Hook
Friendly Name: Slack #alerts
URL to Notify: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POST Value (JSON):
{
  "text": "*monitorFriendlyName* is *alertTypeFriendlyName*",
  "blocks": [
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*alertTypeFriendlyName*: *monitorFriendlyName*"
      }
    },
    {
      "type": "section",
      "fields": [
        {
          "type": "mrkdwn",
          "text": "*URL:*\nmonitorURL"
        },
        {
          "type": "mrkdwn",
          "text": "*Time:*\nalertDateTime"
        },
        {
          "type": "mrkdwn",
          "text": "*Reason:*\nalertDetails"
        }
      ]
    }
  ]
}

Send when: Down, Up

Public Status Page (Optional)

Create Status Page:

UptimeRobot → Status Pages → Create Status Page

Status Page Name: RippleCore Status
Custom URL: ripplecore-status
Monitors to Display:
  - Production App
  - Production API
  - Marketing Website

Design:
  Logo: Upload your logo
  Colors: Purple (#0d1594), Teal (#26dbd9)
  Language: English

Public URL: https://stats.uptimerobot.com/ripplecore-status

Embed in Footer (optional):

<!-- apps/web/components/footer.tsx -->
<a href="https://stats.uptimerobot.com/ripplecore-status" target="_blank">
  System Status
</a>

Sentry Integration

Sentry Setup (Already Configured in Project)

Verify Configuration:

File: packages/observability/sentry.ts

import * as Sentry from "@sentry/nextjs";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,

  // Performance monitoring
  tracesSampleRate: process.env.NODE_ENV === "production" ? 0.1 : 1.0,

  // Session replay (user interactions)
  replaysSessionSampleRate: 0.1,
  replaysOnErrorSampleRate: 1.0,

  // Error filtering
  beforeSend(event, hint) {
    // Filter out low-priority errors
    if (event.exception?.values?.[0]?.type === "ChunkLoadError") {
      return null; // Ignore chunk load errors (user navigated away)
    }
    return event;
  },
});

Environment Variable:

# apps/app/.env.local (and api, web)
SENTRY_DSN=https://your-sentry-dsn@sentry.io/project-id
SENTRY_ORG=your-org
SENTRY_PROJECT=ripplecore-app

Sentry Alerts Configuration

Navigate to: Sentry → Alerts → Create Alert

Alert 1: High Error Rate

Alert Name: High Error Rate (Production)
Environment: production
Conditions:
  - When: The issue is seen more than 100 times in 1 hour
  - And: Issue is unresolved

Actions:
  - Send notification to: #alerts (Slack)
  - Send email to: admin@your-domain.com

Alert 2: New Error Type

Alert Name: New Error Type (Production)
Environment: production
Conditions:
  - When: A new issue is created
  - And: Environment equals production

Actions:
  - Send notification to: #alerts (Slack)

Alert 3: Performance Degradation

Alert Name: Slow API Endpoints
Environment: production
Conditions:
  - When: The transaction duration is greater than 1000ms
  - And: Seen more than 50 times in 10 minutes

Actions:
  - Send notification to: #performance-alerts (Slack)

Alert Configuration

Alert Severity Matrix

SeverityConditionResponse TimeChannelsExample
🔴 CriticalAll services down, database unreachableImmediate (< 5 min)Slack + Email + SMSpostgresql.service stopped
🟠 HighSingle service down, disk >90%15 minutesSlack + Emailripplecore-app container exited
🟡 MediumHigh resource usage, slow endpoints1 hourEmailCPU usage >80% for 5 minutes
🟢 LowInfo logs, deprecation warnings24 hour digestEmailDependency update available

Slack Alert Routing

Channels:

  • #alerts - All critical and high severity alerts
  • #ops-alerts - Infrastructure alerts (CPU, RAM, disk)
  • #database-alerts - Database-specific alerts (connections, queries)
  • #performance-alerts - Slow endpoints, high latency
  • #deployments - Deployment notifications (already configured)

Webhook Configuration:

# In each server's /etc/netdata/health_alarm_notify.conf
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

# Role-based routing
role_recipients_slack[sysadmin]="#ops-alerts"
role_recipients_slack[dba]="#database-alerts"

Email Alert Routing

Distribution Lists:

Critical Alerts:
  - admin@your-domain.com
  - oncall@your-domain.com

High Priority:
  - admin@your-domain.com
  - devops@your-domain.com

Medium/Low Priority:
  - devops@your-domain.com (daily digest)

UptimeRobot Email Format:

Subject: [UP/DOWN] Production App is down

Monitor: Production App (Health)
URL: https://app.your-domain.com/api/health
Status: Down
Reason: Connection timeout after 30 seconds
Time: 2025-01-23 14:30:00 UTC

View details: https://uptimerobot.com/monitor/12345

Dashboards

Netdata Dashboard

Access: http://server-ip:19999 or https://netdata.your-domain.com (via Traefik)

Key Metrics to Monitor:

System Overview:

  • CPU usage (per core and total)
  • RAM usage and swap
  • Disk I/O and space
  • Network traffic

Docker Containers:

  • Container CPU and memory usage
  • Container network I/O
  • Container lifecycle events

PostgreSQL (DB server):

  • Active connections
  • Queries per second
  • Cache hit ratio
  • Transaction rate

Redis (DB server):

  • Memory usage
  • Cache hit rate
  • Commands per second
  • Evicted keys

Custom Grafana Dashboard (Optional)

Prerequisites:

  • Install Grafana on CI/CD server
  • Install Prometheus for metrics collection
  • Configure Netdata to export to Prometheus

Installation:

# Install Grafana
docker run -d \
  --name grafana \
  -p 3000:3000 \
  -v grafana-data:/var/lib/grafana \
  grafana/grafana:latest

# Access: http://cicd-server-ip:3000
# Default login: admin/admin

Import Pre-Built Dashboards:

  • Node Exporter: Dashboard ID 1860
  • Docker Monitoring: Dashboard ID 893
  • PostgreSQL: Dashboard ID 9628
  • Redis: Dashboard ID 11835

UptimeRobot Dashboard

Public Status Page: https://stats.uptimerobot.com/your-status-page

Metrics Displayed:

  • 24-hour uptime percentage
  • 7-day uptime percentage
  • 30-day uptime percentage
  • Average response time
  • Incident history

Private Dashboard: https://uptimerobot.com/dashboard

  • Real-time monitor status
  • Alert logs
  • Response time graphs
  • Downtime analysis

Sentry Dashboard

Access: https://sentry.io/organizations/your-org/projects/ripplecore-app/

Key Views:

Issues:

  • Unresolved errors (prioritize by volume)
  • New issues (last 24 hours)
  • Regressed issues (previously resolved)

Performance:

  • Slowest transactions (>1s)
  • Most frequent transactions
  • Transaction trends

Releases:

  • Error rate by deployment version
  • Compare releases for regression detection

Log Management

Docker Log Configuration

File: docker-compose.yml or Dokploy logging config

services:
  app:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "5"
        compress: "true"
        labels: "service,environment"

Viewing Logs

Real-time Logs:

# View live logs for specific container
docker logs ripplecore-app --follow --tail 100

# Filter by severity
docker logs ripplecore-app --follow | grep ERROR

# Search for specific pattern
docker logs ripplecore-app --since 1h | grep "authentication failed"

Aggregate Logs (all containers):

# View all container logs
docker compose logs --follow

# Filter by service
docker compose logs app --follow

Advanced: Loki + Grafana (Optional)

Benefits:

  • Centralized log aggregation across all servers
  • Query logs with LogQL (like SQL for logs)
  • Correlate logs with metrics in Grafana
  • Long-term retention (30+ days)

Quick Setup:

# docker-compose.yml (on CI/CD server)
version: '3.8'

services:
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log:/var/log
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

volumes:
  loki-data:

Query Examples (in Grafana):

# All errors in last hour
{job="docker"} |= "ERROR" | json

# Slow API requests
{job="docker", container="ripplecore-app"}
  | json
  | duration > 1s

# Authentication failures
{job="docker"} |= "authentication failed"
  | json
  | line_format "{{.userId}} - {{.message}}"

Maintenance & Best Practices

Daily Checks (Automated)

Health Check Script (/root/scripts/daily-health-check.sh):

#!/bin/bash
# Run daily at 9 AM via cron

# Check Netdata is running on all servers
servers=("10.0.1.2" "10.0.1.3" "10.0.1.4" "10.0.2.2")
for server in "${servers[@]}"; do
  if curl -f http://$server:19999/api/v1/info > /dev/null 2>&1; then
    echo "✅ Netdata running on $server"
  else
    echo "❌ Netdata down on $server" | mail -s "Alert: Netdata Down" admin@your-domain.com
  fi
done

# Check UptimeRobot monitors
uptime_api_key="YOUR_API_KEY"
curl -X POST https://api.uptimerobot.com/v2/getMonitors \
  -d "api_key=$uptime_api_key&format=json" \
  | jq '.monitors[] | select(.status != 2) | {name: .friendly_name, status: .status}'

# Check Sentry error rate
# (Use Sentry API to fetch error counts)

Weekly Reviews

Metrics to Review:

  • Uptime percentage (target: 99.5%+)
  • Average response time (target: <200ms)
  • Error rate (target: <0.1%)
  • Disk space growth trend
  • Resource usage trends (CPU, RAM)

Action Items:

  • Review and close resolved Sentry issues
  • Archive old logs (>7 days)
  • Update alert thresholds if needed
  • Review alert false positives

Monthly Audits

Comprehensive Review:

  • Analyze downtime incidents (root cause, prevention)
  • Review alert response times
  • Update alert routing if team changes
  • Test disaster recovery procedures
  • Verify all monitoring agents are updated

Troubleshooting Monitoring

Issue: Netdata Not Starting

Symptoms:

systemctl status netdata
# Output: Failed to start netdata

Solution:

# Check logs
journalctl -u netdata -n 50

# Common issues:
# 1. Port 19999 already in use
sudo lsof -i :19999
sudo kill <PID>

# 2. Permissions issue
sudo chown -R netdata:netdata /var/lib/netdata
sudo chown -R netdata:netdata /var/cache/netdata

# Restart
sudo systemctl restart netdata

Issue: Alerts Not Firing

Symptoms:

  • High CPU but no alert received

Debugging:

# Check alert configuration loaded
curl http://localhost:19999/api/v1/alarms | jq '.alarms[] | select(.name == "high_cpu_usage")'

# Verify Slack webhook
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H 'Content-Type: application/json' \
  -d '{"text": "Test alert"}'

# Check notification config
cat /etc/netdata/health_alarm_notify.conf | grep SLACK

Issue: UptimeRobot Showing Down (but site is up)

Possible Causes:

  1. Health endpoint responding slowly (>30s timeout)
  2. Keyword "ok" not found in response
  3. SSL certificate issue

Solution:

# Test health endpoint manually
time curl -v https://app.your-domain.com/api/health

# Verify response contains "ok"
curl -s https://app.your-domain.com/api/health | grep -o '"status":"ok"'

# Check SSL certificate
curl -vI https://app.your-domain.com 2>&1 | grep "SSL certificate"

  • Infrastructure Overview: See ARCHITECTURE.md
  • CI/CD Pipeline: See CI_CD_PIPELINE.md
  • Backup & DR: See BACKUP_RECOVERY.md

Document Version: 1.0 Last Updated: 2025-01-23 Review Cycle: Quarterly