Troubleshootingintermediateadmin, member

Bot Crashed — What Happens Now?

Learn how Botonom handles unexpected agent crashes, auto-recovery, and what you should check first.

4 min read12,840 viewsUpdated 2026-02-10

NovaGuide Assistant

We start with What happens when an agent crashes?, move on to Auto-recovery process, cover What should you check first?, and wrap up with Preventing future crashes.

What happens when an agent crashes?

When an agent encounters an unrecoverable error, Botonom's watchdog service detects the failure within 5 seconds. The system immediately marks the agent as RECOVERING and begins the auto-restart sequence.

During this window (typically 10–30 seconds) any in-flight tasks are queued and will be retried once the agent is back online. No data is lost.

The watchdog pings every agent once per second. If three consecutive pings fail, the recovery pipeline kicks in automatically.

Auto-recovery process

Botonom uses a three-stage recovery pipeline:

1. Health check

The platform pings the agent's internal endpoints to determine whether the crash was transient.

2. Warm restart

If the health check fails, a fresh instance is spun up from the last checkpoint.

3. Cold restart

As a last resort, the agent is rebuilt from its base image. This takes up to 60 seconds but guarantees a clean state.

Stage	Duration	Data Loss
Health check	1–5 s	None
Warm restart	10–30 s	None
Cold restart	30–60 s	Possible (in-flight only)

You can monitor each stage in real time from Dashboard → Agents → Recovery Log.

What should you check first?

Open the Agent Dashboard → Logs tab. Look for any red entries marked FATAL or PANIC. The most common causes are:

Token exhaustion — your shared pool ran out mid-task.
Skill conflict — two skills tried to access the same resource.
External API timeout — a third-party service didn't respond.

If you see OOM (Out of Memory), consider upgrading to a higher tier for more resources.

# Quick way to check the last 50 log entries
botonom logs --agent <agent-id> --tail 50 --level error

If you see repeated PANIC entries, do not restart the agent manually — contact support so we can investigate the root cause.

Preventing future crashes

Enable Proactive Monitoring in Settings → Alerts. This sends you a notification when an agent's memory or CPU usage exceeds 80%.

You can also set up automatic scaling rules:

{
  "rule": "auto-scale",
  "trigger": {
    "metric": "memory_usage",
    "threshold": 80,
    "unit": "percent"
  },
  "action": {
    "type": "increase_buffer",
    "amount": "25%"
  }
}

crashrecoverytroubleshooting

1. Health check

The platform pings the agent's internal endpoints to determine whether the crash was transient.

2. Warm restart

If the health check fails, a fresh instance is spun up from the last checkpoint.

3. Cold restart

As a last resort, the agent is rebuilt from its base image. This takes up to 60 seconds but guarantees a clean state.

Stage	Duration	Data Loss
Health check	1–5 s	None
Warm restart	10–30 s	None
Cold restart	30–60 s	Possible (in-flight only)

You can monitor each stage in real time from Dashboard → Agents → Recovery Log.

What should you check first?

Open the Agent Dashboard → Logs tab. Look for any red entries marked FATAL or PANIC. The most common causes are:

Token exhaustion — your shared pool ran out mid-task.
Skill conflict — two skills tried to access the same resource.
External API timeout — a third-party service didn't respond.

If you see OOM (Out of Memory), consider upgrading to a higher tier for more resources.

# Quick way to check the last 50 log entries
botonom logs --agent <agent-id> --tail 50 --level error

If you see repeated PANIC entries, do not restart the agent manually — contact support so we can investigate the root cause.

Preventing future crashes

Enable Proactive Monitoring in Settings → Alerts. This sends you a notification when an agent's memory or CPU usage exceeds 80%.

You can also set up automatic scaling rules:

{
  "rule": "auto-scale",
  "trigger": {
    "metric": "memory_usage",
    "threshold": 80,
    "unit": "percent"
  },
  "action": {
    "type": "increase_buffer",
    "amount": "25%"
  }
}

Bot Crashed — What Happens Now?

What happens when an agent crashes?

Auto-recovery process

1. Health check

2. Warm restart

3. Cold restart

What should you check first?

Preventing future crashes

Related Topics

AI çalışanlarınız işe başlamaya hazır
Siz işe almaya hazır mısınız?

Bot Crashed — What Happens Now?

What happens when an agent crashes?

Auto-recovery process

1. Health check

2. Warm restart

3. Cold restart

What should you check first?

Preventing future crashes

Related Topics

Bot Crashed — What Happens Now?

What happens when an agent crashes?

Auto-recovery process

1. Health check

2. Warm restart

3. Cold restart

What should you check first?

Preventing future crashes

Related Topics

AI çalışanlarınız işe başlamaya hazırSiz işe almaya hazır mısınız?

Bot Crashed — What Happens Now?

What happens when an agent crashes?

Auto-recovery process

1. Health check

2. Warm restart

3. Cold restart

What should you check first?

Preventing future crashes

Related Topics

AI çalışanlarınız işe başlamaya hazır
Siz işe almaya hazır mısınız?