Skip to main content

Heartbeat & Monitoring

How Heartbeats Work

Each enrolled agent sends a heartbeat to the backend every 60 seconds (±5s jitter to prevent thundering herd). The heartbeat includes:

  • System metrics — CPU usage, memory usage, disk usage, uptime
  • Agent version — Current binary version
  • Hostname and OS information

Agent Status

StatusMeaning
OnlineHeartbeat received within the last 2 minutes
OfflineNo heartbeat for more than 2 minutes

Agent Management UI

The Agents page displays:

  • Agent hostname and IP
  • Operating system and architecture
  • Current status (online/offline)
  • Last heartbeat timestamp
  • CPU, memory, and disk usage meters
  • Agent version
  • Custom tags

Agent heartbeat — CPU usage, memory usage, and disk free charts over 7 days

Stale Task Detection

If an agent goes offline while executing a task, the task is automatically marked as failed with a stale detection message. This prevents tasks from hanging indefinitely.

Disconnect Reason Reporting

When an agent reconnects after being offline, it reports why it was disconnected. The reason appears in the Event Log tab on the agent detail page, next to the "Came Online" event.

ReasonIconMeaning
Service RestartAgent process crashed and the OS service manager restarted it
Machine RebootThe host machine was rebooted
Network Adapter Disabled📵Network interface was disabled or disconnected
Server Unreachable🖥Agent could reach the network but the server refused connection
DNS Failure🔍DNS resolution failed — server hostname could not be resolved
Network Unreachable📵No network route to the server
Connection TimeoutServer did not respond within the timeout period
TLS/Certificate Error🔒TLS handshake failed (expired certificate, untrusted CA)
Disk Pressure (Crash)💾Agent crashed with critically low disk space (< 100 MB free)
Memory Pressure (Crash)🧠Agent crashed with critically high memory usage (> 90% of total)
Network Recovery📶Generic network issue resolved (none of the above specific causes matched)
Update RestartAgent was restarted after applying a self-update
Unknown?Reason could not be determined (typically agents running an older version)

Agent event log — lifecycle events with disconnect reasons, task completions, and version updates

Reasons are detected via a dual-layer system: agent-side error classification (HTTP error type, network adapter state checks per platform) and backend-side last-known metrics correlation. Process restarts with high memory or low disk at the time of the last heartbeat are correlated with the last-known metrics to infer crash causes (e.g., a service_restart following a heartbeat with > 90% memory usage is tagged as memory_pressure_crash).

Adaptive Heartbeat Backoff

During extended server outages, the agent automatically reduces its heartbeat and polling frequency to minimize wasted network requests:

Consecutive FailuresHeartbeat IntervalTask Poll Interval
0–5Normal (60s)Normal (30s)
6–105 minutes5 minutes
11–2015 minutes15 minutes
21+30 minutes (cap)30 minutes (cap)

On the first successful heartbeat after an outage, both intervals snap back to normal immediately. The agent logs the recovery: connectivity recovered after N consecutive failures, resetting intervals.

Agent Health Score

Each agent is assigned a health score (0–100) visible in the Agents list and on the agent detail page. The score is computed from three metrics over the past 7 days:

ComponentWeightDescription
Heartbeat Consistency40%Ratio of received heartbeats to expected heartbeats
Task Success Rate30%Completed tasks ÷ (completed + failed tasks)
Stability30%Fewer disconnections = higher score (10+ disconnects in 7 days = 0% stability)

The score is color-coded in the UI:

  • Green (80–100) — Healthy agent
  • Amber (50–79) — Some reliability issues
  • Red (0–49) — Unreliable, investigate

The fleet average health score is also shown on the Endpoints Dashboard.