"Monitoring" means different things to different people. For most self-hosted Telegram bot builders, it's a cron job that pings a healthcheck URL every minute. That catches crashes — eventually. But it doesn't fix them, doesn't alert your users, and doesn't tell you why the crash happened. Here's what real automated monitoring for self-hosted Telegram bots actually looks like.
A self-hosted Telegram bot (like one running on OpenClaw) has several distinct layers that can fail independently:
Most simple monitoring tools only check if the process is running. That misses:
The most basic layer. Watches whether the gateway process exists and whether systemd considers it active. This catches outright crashes.
You can set this up yourself with a cron + pgrep, or systemd will restart automatically with Restart=always. The problem: if the process crashes and restarts quickly, you never know it happened.
# crontab -e */2 * * * * pgrep -f openclaw-gateway || systemctl restart openclaw-gateway
Process running ≠ bot working. A real health check sends a test message through the bot's actual Telegram connection and verifies a response. This catches the "running but silent" failure mode.
This is significantly harder to implement yourself — you need a separate process that can talk to Telegram and your bot simultaneously, keep state between checks, and avoid false positives from Telegram API latency.
This is where monitoring gets genuinely useful. Instead of reacting to failures, you watch for warning signs:
Catching these before the crash means your users never notice.
The difference matters because most crashes have a specific cause. Blindly restarting without fixing the cause gives you 30 minutes of uptime before the same crash happens again.
Here's the monitoring + repair flow Mechanic uses for OpenClaw gateways:
Good monitoring includes a notification strategy. A common mistake is alerting your users too aggressively — if every 30-second blip sends a Telegram message, users start ignoring your bot entirely.
What actually works:
If you want to build this yourself, here's a production-grade starting point:
[Unit] Description=OpenClaw Gateway After=network.target [Service] Type=simple ExecStart=/usr/bin/node /path/to/openclaw gateway start Restart=always RestartSec=5s StartLimitInterval=60s StartLimitBurst=5 WatchdogSec=60s # Sends SIGABRT if gateway doesn't ping within 60s NotifyAccess=main [Install] WantedBy=multi-user.target
#!/bin/bash
# /usr/local/bin/oc-health-check.sh
DISK_USED=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
MEM_FREE=$(free -m | awk '/^Mem:/{print $7}')
if [ "$DISK_USED" -gt 85 ]; then
echo "⚠️ Disk at ${DISK_USED}% — clearing npm cache and old logs"
npm cache clean --force 2>/dev/null
journalctl --vacuum-size=100M 2>/dev/null
fi
if [ "$MEM_FREE" -lt 200 ]; then
echo "⚠️ Memory low (${MEM_FREE}MB free) — restarting gateway"
systemctl restart openclaw-gateway
fi
# crontab entry:
# */10 * * * * /usr/local/bin/oc-health-check.sh >> /var/log/oc-health.log 2>&1This gets you to Layer 1 monitoring with basic resource protection. Layers 2 and 3 require significantly more infrastructure — an agent on the machine, a hub to receive events, and logic to correlate signals and decide which repair to attempt.