Monitoring & Alerting
Comprehensive monitoring stack for infrastructure and application observability
Monitoring & Alerting Setup Guide
Comprehensive monitoring stack for infrastructure and application observability
Stack: Netdata (infrastructure) + UptimeRobot (uptime) + Sentry (errors) + Docker Logs Coverage: Server metrics + application health + error tracking + uptime monitoring Cost: €0/month (free tiers) or €26/month (paid Sentry)
Table of Contents
- Monitoring Architecture
- Netdata Setup
- UptimeRobot Configuration
- Sentry Integration
- Alert Configuration
- Dashboards
Monitoring Architecture
Monitoring Stack Overview
┌──────────────────────────────────────────────────────────────┐
│ MONITORING LAYERS │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ LAYER 1: INFRASTRUCTURE METRICS (Netdata) │
├──────────────────────────────────────────────────────────────┤
│ Production App Server (10.0.1.2): │
│ • CPU usage per core, load average │
│ • RAM usage, swap, cache │
│ • Disk I/O, read/write rates │
│ • Network traffic, connections │
│ • Docker container metrics │
│ │
│ Database Server (10.0.1.3): │
│ • PostgreSQL connections, queries/sec │
│ • Redis memory usage, hit rate │
│ • Disk space, backup storage │
│ │
│ CI/CD Server + Staging Server: Same metrics │
│ │
│ Real-time: 1-second intervals │
│ Retention: 14 days (on-disk) │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ LAYER 2: APPLICATION HEALTH (UptimeRobot) │
├──────────────────────────────────────────────────────────────┤
│ HTTPS Monitors (5-minute checks): │
│ • https://app.your-domain.com/api/health │
│ • https://api.your-domain.com/api/health │
│ • https://www.your-domain.com/api/health │
│ • https://staging.your-domain.com/api/health │
│ │
│ Keyword Monitors (check for "ok" in response): │
│ • Verify health endpoint returns valid JSON │
│ │
│ Alerts: Email, Slack, SMS (paid) │
│ Status Page: Public uptime history │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ LAYER 3: ERROR TRACKING (Sentry) │
├──────────────────────────────────────────────────────────────┤
│ Application Errors: │
│ • JavaScript errors (frontend) │
│ • API errors (backend) │
│ • Database errors │
│ • Authentication failures │
│ │
│ Performance Monitoring: │
│ • Slow API endpoints (>1s) │
│ • Database query performance │
│ • Page load times │
│ │
│ User Context: Session ID, user ID, browser info │
│ Source Maps: Stack traces with original code │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ LAYER 4: LOG AGGREGATION (Docker Logs) │
├──────────────────────────────────────────────────────────────┤
│ Container Logs (JSON format): │
│ • Application logs (stdout/stderr) │
│ • Access logs (Traefik) │
│ • Database logs (PostgreSQL) │
│ • Cache logs (Redis) │
│ │
│ Log Rotation: 10MB max size, 5 files retained │
│ Retention: 7 days local storage │
│ │
│ Optional: Loki + Grafana for advanced querying │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ ALERT ROUTING │
├──────────────────────────────────────────────────────────────┤
│ Critical (Immediate): Slack + Email + SMS (optional) │
│ • All services down │
│ • Database unreachable │
│ • Disk >95% full │
│ │
│ High (15 min): Slack + Email │
│ • Single service down │
│ • CPU >90% sustained │
│ • High error rate (>5%) │
│ │
│ Medium (1 hour): Email │
│ • High CPU (>80%) │
│ • Slow API endpoints │
│ • Elevated error rate (>1%) │
│ │
│ Low (24 hour digest): Email │
│ • Deprecation warnings │
│ • Info logs │
└──────────────────────────────────────────────────────────────┘Netdata Setup
Installation (On Each Server)
Install Netdata Agent:
# SSH into server
ssh root@server-ip
# Install Netdata (auto-detects OS and dependencies)
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Answer prompts:
# - Install required packages? Yes
# - Enable telemetry? No (privacy)
# - Claim to Netdata Cloud? Optional (for team dashboards)
# Verify installation
systemctl status netdata
# Access local dashboard
# http://server-ip:19999Repeat for all 4 servers:
- Production App Server (10.0.1.2)
- Production DB Server (10.0.1.3)
- CI/CD Server (10.0.1.4)
- Staging Server (10.0.2.2)
Custom Alerts Configuration
File: /etc/netdata/health.d/custom.conf
# High CPU Usage Alert
alarm: high_cpu_usage
on: system.cpu
lookup: average -3m percentage foreach user,system
every: 1m
warn: $this > 80
crit: $this > 95
info: CPU usage is critically high ($this%)
to: sysadmin
# Low Disk Space Alert
alarm: low_disk_space
on: disk.space
lookup: average -1m percentage of used
every: 1m
warn: $this > 80
crit: $this > 90
info: Disk space is running low ($this% used)
to: sysadmin
# High RAM Usage Alert
alarm: high_memory_usage
on: system.ram
lookup: average -3m percentage
every: 1m
warn: $this > 85
crit: $this > 95
info: RAM usage is critically high ($this%)
to: sysadmin
# High Load Average Alert
alarm: high_load_average
on: system.load
lookup: average -5m
every: 1m
warn: $this > 3
crit: $this > 5
info: System load average is high ($this)
to: sysadmin
# PostgreSQL Connection Alert (DB server only)
alarm: high_postgres_connections
on: postgres.connections_utilization
lookup: average -1m
every: 1m
warn: $this > 80
crit: $this > 95
info: PostgreSQL connections are at $this% capacity
to: dba
# Redis Memory Alert (DB server only)
alarm: high_redis_memory
on: redis.memory
lookup: average -1m
every: 1m
warn: $this > 768MB
crit: $this > 900MB
info: Redis memory usage is $this
to: dba
# Docker Container Down Alert
alarm: docker_container_down
on: docker.containers
lookup: average -30s
every: 10s
crit: $this == 0
info: Docker container stopped unexpectedly
to: sysadminApply Configuration:
# Restart Netdata to apply alerts
systemctl restart netdata
# Verify alerts loaded
curl http://localhost:19999/api/v1/alarmsSlack Integration
Configure Slack Webhook in Netdata:
File: /etc/netdata/health_alarm_notify.conf
# Enable Slack notifications
SEND_SLACK="YES"
# Slack webhook URL (create in Slack: Apps → Incoming Webhooks)
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
# Default Slack channel
DEFAULT_RECIPIENT_SLACK="#alerts"
# Role-specific channels
role_recipients_slack[sysadmin]="#ops-alerts"
role_recipients_slack[dba]="#database-alerts"
# Customize notification format
SLACK_MESSAGE_FORMAT='\
{ \
"text": "${status} - ${alarm}", \
"blocks": [ \
{ \
"type": "section", \
"text": { \
"type": "mrkdwn", \
"text": "*${status}* - ${alarm}" \
} \
}, \
{ \
"type": "section", \
"fields": [ \
{ \
"type": "mrkdwn", \
"text": "*Server:*\n${host}" \
}, \
{ \
"type": "mrkdwn", \
"text": "*Value:*\n${value}" \
}, \
{ \
"type": "mrkdwn", \
"text": "*Severity:*\n${severity}" \
}, \
{ \
"type": "mrkdwn", \
"text": "*Time:*\n${date} ${time}" \
} \
] \
} \
] \
}'Test Slack Integration:
# Send test notification
/usr/libexec/netdata/plugins.d/alarm-notify.sh test
# Check Slack channel for test messageNetdata Cloud (Optional - Team Dashboards)
Benefits:
- Centralized dashboard for all servers
- Team access with role-based permissions
- Alert history and analytics
- Free for <5 nodes
Setup:
# Claim each server to Netdata Cloud
netdata-claim.sh -token=YOUR_CLAIM_TOKEN -rooms=YOUR_ROOM_ID -url=https://app.netdata.cloud
# Access team dashboard
# https://app.netdata.cloudUptimeRobot Configuration
Account Setup
- Create Account: https://uptimerobot.com (Free tier)
- Upgrade (optional): Pro plan ($7/month) for 1-minute checks + SMS alerts
Monitor Configuration
Monitor 1: Main App Health
Monitor Type: HTTPS
URL: https://app.your-domain.com/api/health
Friendly Name: Production App (Health)
Monitoring Interval: 5 minutes
Monitor Timeout: 30 seconds
HTTP Settings:
Method: GET
Expected Status: 200
Keyword Settings:
Keyword Type: Keyword exists
Keyword: "ok" # Verify JSON contains "ok" status
Alert Contacts:
- Email: admin@your-domain.com
- Slack: #alerts (via webhook)
- SMS: +1234567890 (Pro plan only)Monitor 2: API Server Health
Monitor Type: HTTPS
URL: https://api.your-domain.com/api/health
Friendly Name: Production API (Health)
Monitoring Interval: 5 minutes
Expected Status: 200
Keyword: "ok"Monitor 3: Marketing Website Health
Monitor Type: HTTPS
URL: https://www.your-domain.com/api/health
Friendly Name: Production Web (Health)
Monitoring Interval: 5 minutes
Expected Status: 200
Keyword: "ok"Monitor 4: Staging Environment Health
Monitor Type: HTTPS
URL: https://staging.your-domain.com/api/health
Friendly Name: Staging (Health)
Monitoring Interval: 5 minutes
Expected Status: 200
Keyword: "ok"Monitor 5: Database Server TCP (Private - from CI/CD server)
Monitor Type: Port
IP/Host: 10.0.1.3
Port: 5432
Friendly Name: PostgreSQL (Port Check)
Monitoring Interval: 5 minutesMonitor 6: Redis Server TCP (Private - from CI/CD server)
Monitor Type: Port
IP/Host: 10.0.1.3
Port: 6379
Friendly Name: Redis (Port Check)
Monitoring Interval: 5 minutesSlack Integration (UptimeRobot)
Create Slack Webhook:
- Slack: Apps → Incoming Webhooks → Add to Slack
- Choose channel:
#alerts - Copy webhook URL
Configure in UptimeRobot:
Settings → Alert Contacts → Add Alert Contact
Contact Type: Web-Hook
Friendly Name: Slack #alerts
URL to Notify: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POST Value (JSON):
{
"text": "*monitorFriendlyName* is *alertTypeFriendlyName*",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*alertTypeFriendlyName*: *monitorFriendlyName*"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*URL:*\nmonitorURL"
},
{
"type": "mrkdwn",
"text": "*Time:*\nalertDateTime"
},
{
"type": "mrkdwn",
"text": "*Reason:*\nalertDetails"
}
]
}
]
}
Send when: Down, UpPublic Status Page (Optional)
Create Status Page:
UptimeRobot → Status Pages → Create Status Page
Status Page Name: RippleCore Status
Custom URL: ripplecore-status
Monitors to Display:
- Production App
- Production API
- Marketing Website
Design:
Logo: Upload your logo
Colors: Purple (#0d1594), Teal (#26dbd9)
Language: English
Public URL: https://stats.uptimerobot.com/ripplecore-statusEmbed in Footer (optional):
<!-- apps/web/components/footer.tsx -->
<a href="https://stats.uptimerobot.com/ripplecore-status" target="_blank">
System Status
</a>Sentry Integration
Sentry Setup (Already Configured in Project)
Verify Configuration:
File: packages/observability/sentry.ts
import * as Sentry from "@sentry/nextjs";
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
// Performance monitoring
tracesSampleRate: process.env.NODE_ENV === "production" ? 0.1 : 1.0,
// Session replay (user interactions)
replaysSessionSampleRate: 0.1,
replaysOnErrorSampleRate: 1.0,
// Error filtering
beforeSend(event, hint) {
// Filter out low-priority errors
if (event.exception?.values?.[0]?.type === "ChunkLoadError") {
return null; // Ignore chunk load errors (user navigated away)
}
return event;
},
});Environment Variable:
# apps/app/.env.local (and api, web)
SENTRY_DSN=https://your-sentry-dsn@sentry.io/project-id
SENTRY_ORG=your-org
SENTRY_PROJECT=ripplecore-appSentry Alerts Configuration
Navigate to: Sentry → Alerts → Create Alert
Alert 1: High Error Rate
Alert Name: High Error Rate (Production)
Environment: production
Conditions:
- When: The issue is seen more than 100 times in 1 hour
- And: Issue is unresolved
Actions:
- Send notification to: #alerts (Slack)
- Send email to: admin@your-domain.comAlert 2: New Error Type
Alert Name: New Error Type (Production)
Environment: production
Conditions:
- When: A new issue is created
- And: Environment equals production
Actions:
- Send notification to: #alerts (Slack)Alert 3: Performance Degradation
Alert Name: Slow API Endpoints
Environment: production
Conditions:
- When: The transaction duration is greater than 1000ms
- And: Seen more than 50 times in 10 minutes
Actions:
- Send notification to: #performance-alerts (Slack)Alert Configuration
Alert Severity Matrix
| Severity | Condition | Response Time | Channels | Example |
|---|---|---|---|---|
| 🔴 Critical | All services down, database unreachable | Immediate (< 5 min) | Slack + Email + SMS | postgresql.service stopped |
| 🟠 High | Single service down, disk >90% | 15 minutes | Slack + Email | ripplecore-app container exited |
| 🟡 Medium | High resource usage, slow endpoints | 1 hour | CPU usage >80% for 5 minutes | |
| 🟢 Low | Info logs, deprecation warnings | 24 hour digest | Dependency update available |
Slack Alert Routing
Channels:
#alerts- All critical and high severity alerts#ops-alerts- Infrastructure alerts (CPU, RAM, disk)#database-alerts- Database-specific alerts (connections, queries)#performance-alerts- Slow endpoints, high latency#deployments- Deployment notifications (already configured)
Webhook Configuration:
# In each server's /etc/netdata/health_alarm_notify.conf
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
# Role-based routing
role_recipients_slack[sysadmin]="#ops-alerts"
role_recipients_slack[dba]="#database-alerts"Email Alert Routing
Distribution Lists:
Critical Alerts:
- admin@your-domain.com
- oncall@your-domain.com
High Priority:
- admin@your-domain.com
- devops@your-domain.com
Medium/Low Priority:
- devops@your-domain.com (daily digest)UptimeRobot Email Format:
Subject: [UP/DOWN] Production App is down
Monitor: Production App (Health)
URL: https://app.your-domain.com/api/health
Status: Down
Reason: Connection timeout after 30 seconds
Time: 2025-01-23 14:30:00 UTC
View details: https://uptimerobot.com/monitor/12345Dashboards
Netdata Dashboard
Access: http://server-ip:19999 or https://netdata.your-domain.com (via Traefik)
Key Metrics to Monitor:
System Overview:
- CPU usage (per core and total)
- RAM usage and swap
- Disk I/O and space
- Network traffic
Docker Containers:
- Container CPU and memory usage
- Container network I/O
- Container lifecycle events
PostgreSQL (DB server):
- Active connections
- Queries per second
- Cache hit ratio
- Transaction rate
Redis (DB server):
- Memory usage
- Cache hit rate
- Commands per second
- Evicted keys
Custom Grafana Dashboard (Optional)
Prerequisites:
- Install Grafana on CI/CD server
- Install Prometheus for metrics collection
- Configure Netdata to export to Prometheus
Installation:
# Install Grafana
docker run -d \
--name grafana \
-p 3000:3000 \
-v grafana-data:/var/lib/grafana \
grafana/grafana:latest
# Access: http://cicd-server-ip:3000
# Default login: admin/adminImport Pre-Built Dashboards:
- Node Exporter: Dashboard ID 1860
- Docker Monitoring: Dashboard ID 893
- PostgreSQL: Dashboard ID 9628
- Redis: Dashboard ID 11835
UptimeRobot Dashboard
Public Status Page: https://stats.uptimerobot.com/your-status-page
Metrics Displayed:
- 24-hour uptime percentage
- 7-day uptime percentage
- 30-day uptime percentage
- Average response time
- Incident history
Private Dashboard: https://uptimerobot.com/dashboard
- Real-time monitor status
- Alert logs
- Response time graphs
- Downtime analysis
Sentry Dashboard
Access: https://sentry.io/organizations/your-org/projects/ripplecore-app/
Key Views:
Issues:
- Unresolved errors (prioritize by volume)
- New issues (last 24 hours)
- Regressed issues (previously resolved)
Performance:
- Slowest transactions (>1s)
- Most frequent transactions
- Transaction trends
Releases:
- Error rate by deployment version
- Compare releases for regression detection
Log Management
Docker Log Configuration
File: docker-compose.yml or Dokploy logging config
services:
app:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "5"
compress: "true"
labels: "service,environment"Viewing Logs
Real-time Logs:
# View live logs for specific container
docker logs ripplecore-app --follow --tail 100
# Filter by severity
docker logs ripplecore-app --follow | grep ERROR
# Search for specific pattern
docker logs ripplecore-app --since 1h | grep "authentication failed"Aggregate Logs (all containers):
# View all container logs
docker compose logs --follow
# Filter by service
docker compose logs app --followAdvanced: Loki + Grafana (Optional)
Benefits:
- Centralized log aggregation across all servers
- Query logs with LogQL (like SQL for logs)
- Correlate logs with metrics in Grafana
- Long-term retention (30+ days)
Quick Setup:
# docker-compose.yml (on CI/CD server)
version: '3.8'
services:
loki:
image: grafana/loki:2.9.0
ports:
- "3100:3100"
volumes:
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
promtail:
image: grafana/promtail:2.9.0
volumes:
- /var/log:/var/log
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail-config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
volumes:
loki-data:Query Examples (in Grafana):
# All errors in last hour
{job="docker"} |= "ERROR" | json
# Slow API requests
{job="docker", container="ripplecore-app"}
| json
| duration > 1s
# Authentication failures
{job="docker"} |= "authentication failed"
| json
| line_format "{{.userId}} - {{.message}}"Maintenance & Best Practices
Daily Checks (Automated)
Health Check Script (/root/scripts/daily-health-check.sh):
#!/bin/bash
# Run daily at 9 AM via cron
# Check Netdata is running on all servers
servers=("10.0.1.2" "10.0.1.3" "10.0.1.4" "10.0.2.2")
for server in "${servers[@]}"; do
if curl -f http://$server:19999/api/v1/info > /dev/null 2>&1; then
echo "✅ Netdata running on $server"
else
echo "❌ Netdata down on $server" | mail -s "Alert: Netdata Down" admin@your-domain.com
fi
done
# Check UptimeRobot monitors
uptime_api_key="YOUR_API_KEY"
curl -X POST https://api.uptimerobot.com/v2/getMonitors \
-d "api_key=$uptime_api_key&format=json" \
| jq '.monitors[] | select(.status != 2) | {name: .friendly_name, status: .status}'
# Check Sentry error rate
# (Use Sentry API to fetch error counts)Weekly Reviews
Metrics to Review:
- Uptime percentage (target: 99.5%+)
- Average response time (target: <200ms)
- Error rate (target: <0.1%)
- Disk space growth trend
- Resource usage trends (CPU, RAM)
Action Items:
- Review and close resolved Sentry issues
- Archive old logs (>7 days)
- Update alert thresholds if needed
- Review alert false positives
Monthly Audits
Comprehensive Review:
- Analyze downtime incidents (root cause, prevention)
- Review alert response times
- Update alert routing if team changes
- Test disaster recovery procedures
- Verify all monitoring agents are updated
Troubleshooting Monitoring
Issue: Netdata Not Starting
Symptoms:
systemctl status netdata
# Output: Failed to start netdataSolution:
# Check logs
journalctl -u netdata -n 50
# Common issues:
# 1. Port 19999 already in use
sudo lsof -i :19999
sudo kill <PID>
# 2. Permissions issue
sudo chown -R netdata:netdata /var/lib/netdata
sudo chown -R netdata:netdata /var/cache/netdata
# Restart
sudo systemctl restart netdataIssue: Alerts Not Firing
Symptoms:
- High CPU but no alert received
Debugging:
# Check alert configuration loaded
curl http://localhost:19999/api/v1/alarms | jq '.alarms[] | select(.name == "high_cpu_usage")'
# Verify Slack webhook
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-Type: application/json' \
-d '{"text": "Test alert"}'
# Check notification config
cat /etc/netdata/health_alarm_notify.conf | grep SLACKIssue: UptimeRobot Showing Down (but site is up)
Possible Causes:
- Health endpoint responding slowly (>30s timeout)
- Keyword "ok" not found in response
- SSL certificate issue
Solution:
# Test health endpoint manually
time curl -v https://app.your-domain.com/api/health
# Verify response contains "ok"
curl -s https://app.your-domain.com/api/health | grep -o '"status":"ok"'
# Check SSL certificate
curl -vI https://app.your-domain.com 2>&1 | grep "SSL certificate"Related Documentation
- Infrastructure Overview: See
ARCHITECTURE.md - CI/CD Pipeline: See
CI_CD_PIPELINE.md - Backup & DR: See
BACKUP_RECOVERY.md
Document Version: 1.0 Last Updated: 2025-01-23 Review Cycle: Quarterly