Failures
Inspect errors, retries, and failed execution branches.
Overview
The Failures view groups all traces that need attention into three categories: failed traces, traces with retries, and slow requests. Data is fetched from the GET /failures endpoint, which applies server-side filtering and returns pre-categorized results.
Failure Categories
| Category | Definition | Filter Logic |
|---|---|---|
| Failed | trace.status == failed | Any trace where status is explicitly failed |
| Retries | trace.retry_count > 0 | Traces with one or more retried steps |
| Slow requests | trace.slow_request == true | Traces with total latency >= 1500ms |
How Failures Are Detected
Trace status is inferred through a multi-step process in normalize_trace_document():
- If the payload explicitly sets status to a valid value (success/warning/failed), use it
- If any step has
success=False, status becomes failed - If
failure_reasonis set orretry_count > 0, status becomes warning - Otherwise, status is success
Retry Detection
Retries are detected by _infer_retry_count() which counts duplicate tool names in the step list. Each time a tool name appears more than once, it is counted as a retry:
def _infer_retry_count(steps: list[dict[str, Any]]) -> int:
retries = 0
tool_attempts: defaultdict[str, int] = defaultdict(int)
for step in steps:
tool_name = step.get("tool_name", "agent")
tool_attempts[tool_name] += 1
if tool_attempts[tool_name] > 1:
retries += 1
return retriesSlow Request Threshold
The slow request threshold is 1500ms, defined as SLOW_TRACE_THRESHOLD_MS. If a trace's total latency equals or exceeds this value, slow_request is set to true. The same threshold is used by the CLI to color-code latency values (green below 900ms, yellow from 900ms to 1499ms, red at 1500ms+).
Troubleshooting Failures
| Observation | Common Cause | Action |
|---|---|---|
| Status is failed | Exception in traced function or step with success=False | Inspect the failure_reason field and the step detail |
| Status is warning | Retries occurred during execution | Review which steps were retried and check the error output |
| slow_request is true | Total latency >= 1500ms | Optimize slow steps or increase the threshold |
| Retry count is unexpected | Duplicate tool names in step list | Check for unintended repeated tool invocations |