Architecture
Understand TraceLLM internals and system boundaries.
System Architecture
TraceLLM is built as a four-layer observability stack that runs entirely on your own infrastructure. Every component — CLI, SDK, API server, database, WebSocket broker, and dashboard — can operate on a single machine with zero cloud dependencies.
System Layout
┌─────────────────────────────────────────────────────────────┐
│ User / Application │
│ tracellm trace "..." │ @trace decorator │ SDK code │
└───────────────────────────┬─────────────────────────────────────┘
│ trace payload
▼
┌────────────────────────────────────────────────────────────┐
│ FastAPI Backend │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ REST API │ │ WebSocket │ │ Connection │ │
│ │ /traces │ │ /ws │ │ Manager │ │
│ │ /analytics │ │ broadcast │ │ (asyncio.Lock) │ │
│ │ /failures │ │ trace.created│ │ auto-prune │ │
│ │ /projects │ │ system.conn.│ │ stale conns │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────────────┘ │
│ │ │ │
└─────────┼─────────────────┼─────────────────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────────────────┐
│ MongoDB │ │ WebSocket Clients │
│ ┌──────────────┐ │ │ ┌──────────┐ ┌──────────────┐ │
│ │ traces │ │ │ │Dashboard │ │ CLI Monitor │ │
│ │ projects │ │ │ │(Next.js) │ │ (tracellm │ │
│ │ api_keys │ │ │ │port 3000 │ │ monitor) │ │
│ └──────────────┘ │ │ └──────────┘ └──────────────┘ │
│ Indexed: trace_id│ └──────────────────────────────────┘
│ created_at, │
│ status, model │
└──────────────────┘Trace Execution Flow
Every trace passes through six stages from instrumentation to consumption:
User │ ├── 1. Instrumentation ──► @trace decorator or CLI wraps function call │ captures start time, project context ▼ SDK/Tracer │ ├── 2. Trace Capture ──► build_trace_payload() assembles metadata, │ steps, timing, status, retries ▼ Trace Payload │ ├── 3. Persist ──► normalize_trace_document() validates against schemas, │ inserts into MongoDB "traces" collection ▼ MongoDB │ ├── 4. Broadcast ──► ConnectionManager.broadcast() sends trace.created │ event to all connected WebSocket clients ▼ WebSocket Layer │ ├── 5. Consume ──► Dashboard receives event, updates trace list in │ real time without polling ▼ Dashboard │ ├── 6. Replay ──► CLI replay fetches trace from MongoDB, renders │ step-by-step execution tree in terminal ▼ Replay
Local-First Architecture
TraceLLM is designed around a local-first philosophy. Every component — SDK, API, database, WebSocket broker, and dashboard — runs on your own infrastructure. There is no SaaS backend, no telemetry service, and no data egress.
No Cloud Dependencies
MongoDB runs locally (or your own Atlas cluster). The API, WebSocket, and dashboard are local processes. Zero data leaves your network.
Single-Command Stack
tracellm start boots the entire stack: FastAPI on port 8000, WebSocket on /ws, auto-detects MongoDB. No Docker Compose required.
Offline-Capable
The @trace decorator and CLI work without MongoDB. Traces are finalized in memory; persistence gracefully degrades.
MIT Licensed
Fully open-source. No paid tiers, no usage limits, no vendor lock-in. You own all your data and infrastructure.
System Components
The system is composed of five layers that work together to capture, store, stream, and visualize trace data:
CLI (tracellm)
start), running traces (trace), replaying executions (replay), monitoring live events (monitor), and exporting data (export). The CLI also renders a Rich TUI command palette when invoked without arguments and a full-screen live monitor dashboard — like htop for AI — via the WebSocket.SDK (@trace decorator)
@trace decorator for automatic instrumentation of any function. It captures prompts, responses, latency, token usage, tool calls, retries, and errors. The decorator supports both sync and async functions via inspect.iscoroutinefunction() auto-detection, usescontextvars.ContextVar for nested step collection, and includes integrations for OpenAI, Groq, and LangChain.API Server (FastAPI)
GET /traces, GET /traces/{id}), analytics (GET /analytics), failures (GET /failures), projects (GET/POST /projects, GET /api-keys), and health (GET /). All responses are JSON with Pydantic v2 validation. CORS is configured to allow all origins for local development.MongoDB (Motor)
db.py with automatic event-loop detection. Three collections are used: traces,projects, and api_keys, each with appropriate indexes.WebSocket Layer
/ws. The ConnectionManager class manages all active connections with anasyncio.Lock for thread safety. When a trace is persisted, atrace.created event is broadcast to all connected clients. Stale connections are automatically pruned during broadcast. The dashboard and CLI monitor both subscribe to this channel for real-time updates.Dashboard (Next.js)
ObservabilityProvider React context manages the WebSocket connection and propagates events to all child components.Data Flow
@trace decorator / CLI
│
│ 1. Record start time (datetime.utcnow + perf_counter)
│ 2. Resolve project context (API key lookup in MongoDB)
│ 3. Set ContextVar for step collection
▼
Function execution (sync or async)
│
│ @trace_tool and integration calls append steps
│ to the parent's ContextVar.collected_steps list
▼
finally block
│
│ 1. Compute latency = perf_counter delta
│ 2. build_trace_payload() — assemble full trace dict
│ 3. finalize_trace() — persist + broadcast + render
▼
save_trace() in trace_service.py
│
├──► normalize_trace_document()
│ ├── Coerce timestamps to UTC
│ ├── Normalize step fields
│ ├── Infer retry count (duplicate tool names)
│ ├── Infer status (from steps, failure_reason, retries)
│ └── Validate against TraceSchema + StepSchema
│
├──► collection.insert_one(document) ──► MongoDB "traces"
│
└──► manager.broadcast(trace.created) ──► WebSocket clients
│
├── Dashboard: updates trace list + live logs
├── CLI monitor: refreshes live dashboard
└── Other clients: any WS subscriberSync/Async Bridge
The CLI runs synchronously (Typer), but all MongoDB operations are async (Motor). TraceLLM bridges this gap with a persistent event loop in db.py:
def _run_async(coro):
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = None
if loop is not None and loop.is_running():
# Inside FastAPI — schedule on the existing loop
task = loop.create_task(coro)
task.add_done_callback(_handle_task_exception)
else:
# In CLI — use a persistent loop to avoid
# "event loop is closed" errors
if PERSISTENT_LOOP is None:
init_persistent_loop()
asyncio.run_coroutine_threadsafe(coro, PERSISTENT_LOOP)Info
asyncio.run() calls from synchronous code. It is initialized once and reused across all CLI commands.Project & API Key Model
Projects and API keys provide multi-tenant trace isolation. API keys use atlm_sk_ prefix with 32 cryptographically random characters. When provided to @trace or the CLI, the project ID, name, and environment are resolved from the key record in MongoDB. Keys can be scoped to specific environments (development, staging, production) for fine-grained access control.
POST /projects?name=my-app&environment=production&description=...
Response:
{
"project": {
"project_id": "my-app",
"name": "my-app",
"description": "...",
"created_at": "2026-05-31T14:22:10"
},
"api_key": {
"key": "tlm_sk_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
"project_id": "my-app",
"environment": "production",
"created_at": "2026-05-31T14:22:10"
}
}Directory Structure
tracellm/
├── backend/ # Python backend (FastAPI + SDK + CLI)
│ ├── app/ # FastAPI REST API
│ │ ├── main.py # App creation, CORS, startup/shutdown
│ │ ├── database/
│ │ │ ├── mongodb.py # Motor connection manager
│ │ │ ├── trace_service.py # Trace CRUD, normalization, analytics
│ │ │ └── project_service.py# Project CRUD, API key generation
│ │ ├── models/
│ │ │ ├── trace.py # TraceSchema, StepSchema (Pydantic)
│ │ │ ├── trace_model.py # List, Analytics, Failure response models
│ │ │ ├── project.py # Project, ApiKey schemas
│ │ │ └── health.py # Health check model
│ │ ├── routes/
│ │ │ ├── health.py # GET /
│ │ │ ├── observability.py # GET /traces, /analytics, /failures
│ │ │ └── projects.py # GET/POST /projects, GET /api-keys
│ │ └── websocket/
│ │ └── socket.py # /ws endpoint, ConnectionManager
│ ├── tracellm/ # SDK + CLI package
│ │ ├── __init__.py # Public API: @trace, wrap_openai, etc.
│ │ ├── cli.py # Typer CLI (start, trace, replay, ...)
│ │ ├── tracer.py # @trace decorator, payload builder
│ │ ├── replay.py # Replay engine
│ │ ├── monitor.py # Live terminal monitor (htop-for-AI)
│ │ ├── exporter.py # JSON/CSV export
│ │ ├── db.py # Sync/async MongoDB bridge
│ │ ├── startup.py # Stack boot (uvicorn subprocess)
│ │ ├── trace_stream.py # Live console event stream
│ │ ├── utils.py # Styling, token estimation, tables
│ │ ├── integrations/
│ │ │ ├── openai.py # OpenAI wrapper
│ │ │ ├── langchain.py # LangChain callback handler
│ │ │ └── tool_tracer.py # @trace_tool decorator
│ │ └── examples/ # Usage examples
│ └── .env # Local env config
├── frontend/ # Next.js dashboard (port 3000)
│ ├── app/ # Pages: /, /traces, /analytics, /failures, /live-logs, /settings
│ ├── components/ # React components
│ │ ├── providers/observability-provider.tsx # WebSocket context
│ │ ├── console/ # Console UI components
│ │ └── ui/ # shadcn/ui primitives
│ ├── hooks/ # use-observability-data, use-websocket-logs
│ └── lib/ # api.ts, types.ts, format.ts
└── website/ # Marketing site + docs (port 3001)
├── app/ # Landing page + docs
└── components/docs/ # Documentation components