Documentation

Architecture

Understand TraceLLM internals and system boundaries.

System Architecture

TraceLLM is built as a four-layer observability stack that runs entirely on your own infrastructure. Every component — CLI, SDK, API server, database, WebSocket broker, and dashboard — can operate on a single machine with zero cloud dependencies.

System Layout

Architecture diagramCopy
text
┌─────────────────────────────────────────────────────────────┐
│                      User / Application                        │
│  tracellm trace "..."   │   @trace decorator   │   SDK code    │
└───────────────────────────┬─────────────────────────────────────┘
                            │  trace payload
                            ▼
┌────────────────────────────────────────────────────────────┐
│                     FastAPI Backend                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │  REST API    │  │  WebSocket   │  │  Connection      │  │
│  │  /traces     │  │  /ws         │  │  Manager         │  │
│  │  /analytics  │  │  broadcast   │  │  (asyncio.Lock)  │  │
│  │  /failures   │  │  trace.created│ │  auto-prune      │  │
│  │  /projects   │  │  system.conn.│  │  stale conns     │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────────────┘  │
│         │                 │                                 │
└─────────┼─────────────────┼─────────────────────────────────┘
          │                 │
          ▼                 ▼
┌──────────────────┐  ┌──────────────────────────────────┐
│    MongoDB        │  │         WebSocket Clients        │
│  ┌──────────────┐ │  │  ┌──────────┐  ┌──────────────┐ │
│  │ traces       │ │  │  │Dashboard │  │ CLI Monitor  │ │
│  │ projects     │ │  │  │(Next.js) │  │ (tracellm    │ │
│  │ api_keys     │ │  │  │port 3000 │  │  monitor)    │ │
│  └──────────────┘ │  │  └──────────┘  └──────────────┘ │
│  Indexed: trace_id│  └──────────────────────────────────┘
│  created_at,      │
│  status, model    │
└──────────────────┘

Trace Execution Flow

Every trace passes through six stages from instrumentation to consumption:

End-to-end trace flowCopy
text
User
  │
  ├── 1. Instrumentation ──► @trace decorator or CLI wraps function call
  │                           captures start time, project context
  ▼
SDK/Tracer
  │
  ├── 2. Trace Capture ──► build_trace_payload() assembles metadata,
  │                           steps, timing, status, retries
  ▼
Trace Payload
  │
  ├── 3. Persist ──► normalize_trace_document() validates against schemas,
  │                     inserts into MongoDB "traces" collection
  ▼
MongoDB
  │
  ├── 4. Broadcast ──► ConnectionManager.broadcast() sends trace.created
  │                       event to all connected WebSocket clients
  ▼
WebSocket Layer
  │
  ├── 5. Consume ──► Dashboard receives event, updates trace list in
  │                     real time without polling
  ▼
Dashboard
  │
  ├── 6. Replay ──► CLI replay fetches trace from MongoDB, renders
  │                    step-by-step execution tree in terminal
  ▼
Replay

Local-First Architecture

TraceLLM is designed around a local-first philosophy. Every component — SDK, API, database, WebSocket broker, and dashboard — runs on your own infrastructure. There is no SaaS backend, no telemetry service, and no data egress.

No Cloud Dependencies

MongoDB runs locally (or your own Atlas cluster). The API, WebSocket, and dashboard are local processes. Zero data leaves your network.

Single-Command Stack

tracellm start boots the entire stack: FastAPI on port 8000, WebSocket on /ws, auto-detects MongoDB. No Docker Compose required.

Offline-Capable

The @trace decorator and CLI work without MongoDB. Traces are finalized in memory; persistence gracefully degrades.

MIT Licensed

Fully open-source. No paid tiers, no usage limits, no vendor lock-in. You own all your data and infrastructure.

System Components

The system is composed of five layers that work together to capture, store, stream, and visualize trace data:

1

CLI (tracellm)

The Typer-based command-line interface is the primary user-facing entry point. It provides commands for starting the stack (start), running traces (trace), replaying executions (replay), monitoring live events (monitor), and exporting data (export). The CLI also renders a Rich TUI command palette when invoked without arguments and a full-screen live monitor dashboard — like htop for AI — via the WebSocket.
2

SDK (@trace decorator)

The Python SDK provides the @trace decorator for automatic instrumentation of any function. It captures prompts, responses, latency, token usage, tool calls, retries, and errors. The decorator supports both sync and async functions via inspect.iscoroutinefunction() auto-detection, usescontextvars.ContextVar for nested step collection, and includes integrations for OpenAI, Groq, and LangChain.
3

API Server (FastAPI)

The REST API server is built with FastAPI and runs on port 8000. It exposes endpoints for traces (GET /traces, GET /traces/{id}), analytics (GET /analytics), failures (GET /failures), projects (GET/POST /projects, GET /api-keys), and health (GET /). All responses are JSON with Pydantic v2 validation. CORS is configured to allow all origins for local development.
4

MongoDB (Motor)

MongoDB serves as the persistent store for all trace documents, project records, and API keys. The async connection is managed via the Motor driver, which integrates natively with FastAPI's event loop. The CLI bridges sync code to Motor through a persistent event loop in db.py with automatic event-loop detection. Three collections are used: traces,projects, and api_keys, each with appropriate indexes.
5

WebSocket Layer

A lightweight WebSocket server is embedded in the FastAPI app at /ws. The ConnectionManager class manages all active connections with anasyncio.Lock for thread safety. When a trace is persisted, atrace.created event is broadcast to all connected clients. Stale connections are automatically pruned during broadcast. The dashboard and CLI monitor both subscribe to this channel for real-time updates.
6

Dashboard (Next.js)

The web dashboard is a Next.js 16 application running on port 3000. It connects to the backend REST API for initial data loads and subscribes to the WebSocket for real-time trace events. The dashboard provides five views: Overview (summary metrics), Traces (browse and inspect), Analytics (charts and breakdowns), Live Logs (real-time event stream), and Failures (categorized issues). TheObservabilityProvider React context manages the WebSocket connection and propagates events to all child components.

Data Flow

Complete data flowCopy
text
@trace decorator / CLI
       │
       │  1. Record start time (datetime.utcnow + perf_counter)
       │  2. Resolve project context (API key lookup in MongoDB)
       │  3. Set ContextVar for step collection
       ▼
Function execution (sync or async)
       │
       │  @trace_tool and integration calls append steps
       │  to the parent's ContextVar.collected_steps list
       ▼
finally block
       │
       │  1. Compute latency = perf_counter delta
       │  2. build_trace_payload() — assemble full trace dict
       │  3. finalize_trace() — persist + broadcast + render
       ▼
save_trace() in trace_service.py
       │
       ├──► normalize_trace_document()
       │      ├── Coerce timestamps to UTC
       │      ├── Normalize step fields
       │      ├── Infer retry count (duplicate tool names)
       │      ├── Infer status (from steps, failure_reason, retries)
       │      └── Validate against TraceSchema + StepSchema
       │
       ├──► collection.insert_one(document)  ──►  MongoDB "traces"
       │
       └──► manager.broadcast(trace.created) ──►  WebSocket clients
              │
              ├── Dashboard: updates trace list + live logs
              ├── CLI monitor: refreshes live dashboard
              └── Other clients: any WS subscriber

Sync/Async Bridge

The CLI runs synchronously (Typer), but all MongoDB operations are async (Motor). TraceLLM bridges this gap with a persistent event loop in db.py:

db.py bridge logicCopy
python
def _run_async(coro):
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        loop = None

    if loop is not None and loop.is_running():
        # Inside FastAPI — schedule on the existing loop
        task = loop.create_task(coro)
        task.add_done_callback(_handle_task_exception)
    else:
        # In CLI — use a persistent loop to avoid
        # "event loop is closed" errors
        if PERSISTENT_LOOP is None:
            init_persistent_loop()
        asyncio.run_coroutine_threadsafe(coro, PERSISTENT_LOOP)

Info

The persistent event loop pattern prevents the common "event loop is closed" error that occurs with repeated asyncio.run() calls from synchronous code. It is initialized once and reused across all CLI commands.

Project & API Key Model

Projects and API keys provide multi-tenant trace isolation. API keys use atlm_sk_ prefix with 32 cryptographically random characters. When provided to @trace or the CLI, the project ID, name, and environment are resolved from the key record in MongoDB. Keys can be scoped to specific environments (development, staging, production) for fine-grained access control.

API key creationCopy
bash
POST /projects?name=my-app&environment=production&description=...
Response:
{
  "project": {
    "project_id": "my-app",
    "name": "my-app",
    "description": "...",
    "created_at": "2026-05-31T14:22:10"
  },
  "api_key": {
    "key": "tlm_sk_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
    "project_id": "my-app",
    "environment": "production",
    "created_at": "2026-05-31T14:22:10"
  }
}

Directory Structure

Full project structureCopy
text
tracellm/
├── backend/                      # Python backend (FastAPI + SDK + CLI)
│   ├── app/                      # FastAPI REST API
│   │   ├── main.py               # App creation, CORS, startup/shutdown
│   │   ├── database/
│   │   │   ├── mongodb.py        # Motor connection manager
│   │   │   ├── trace_service.py  # Trace CRUD, normalization, analytics
│   │   │   └── project_service.py# Project CRUD, API key generation
│   │   ├── models/
│   │   │   ├── trace.py          # TraceSchema, StepSchema (Pydantic)
│   │   │   ├── trace_model.py    # List, Analytics, Failure response models
│   │   │   ├── project.py        # Project, ApiKey schemas
│   │   │   └── health.py         # Health check model
│   │   ├── routes/
│   │   │   ├── health.py         # GET /
│   │   │   ├── observability.py  # GET /traces, /analytics, /failures
│   │   │   └── projects.py       # GET/POST /projects, GET /api-keys
│   │   └── websocket/
│   │       └── socket.py         # /ws endpoint, ConnectionManager
│   ├── tracellm/                 # SDK + CLI package
│   │   ├── __init__.py           # Public API: @trace, wrap_openai, etc.
│   │   ├── cli.py                # Typer CLI (start, trace, replay, ...)
│   │   ├── tracer.py             # @trace decorator, payload builder
│   │   ├── replay.py             # Replay engine
│   │   ├── monitor.py            # Live terminal monitor (htop-for-AI)
│   │   ├── exporter.py           # JSON/CSV export
│   │   ├── db.py                 # Sync/async MongoDB bridge
│   │   ├── startup.py            # Stack boot (uvicorn subprocess)
│   │   ├── trace_stream.py       # Live console event stream
│   │   ├── utils.py              # Styling, token estimation, tables
│   │   ├── integrations/
│   │   │   ├── openai.py         # OpenAI wrapper
│   │   │   ├── langchain.py      # LangChain callback handler
│   │   │   └── tool_tracer.py    # @trace_tool decorator
│   │   └── examples/             # Usage examples
│   └── .env                      # Local env config
├── frontend/                     # Next.js dashboard (port 3000)
│   ├── app/                      # Pages: /, /traces, /analytics, /failures, /live-logs, /settings
│   ├── components/               # React components
│   │   ├── providers/observability-provider.tsx  # WebSocket context
│   │   ├── console/              # Console UI components
│   │   └── ui/                   # shadcn/ui primitives
│   ├── hooks/                    # use-observability-data, use-websocket-logs
│   └── lib/                      # api.ts, types.ts, format.ts
└── website/                      # Marketing site + docs (port 3001)
    ├── app/                      # Landing page + docs
    └── components/docs/          # Documentation components