Documentation
Groq Example
Trace Groq API calls via OpenAI-compatible client.
Overview
Groq exposes an OpenAI-compatible API, so the TraceLLM OpenAI integration works directly. The only change is setting base_url and using a Groq API key. This example runs llama-3.3-70b-versatile on Groq hardware.
Code
groq_example.pyCopy
python
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from openai import OpenAI
from tracellm import trace
from tracellm.integrations.openai import wrap_openai
client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key=os.environ["GROQ_API_KEY"],
)
client = wrap_openai(client)
@trace(
prompt="groq_inference",
model_name="llama-3.3-70b-versatile",
project="multi-provider",
environment="development",
)
def run_groq(prompt: str) -> str:
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=1024,
)
return response.choices[0].message.content
if __name__ == "__main__":
result = run_groq(
"Explain how Groq's LPU inference architecture achieves "
"low latency compared to traditional GPU-based inference."
)
print(f"\nResponse ({len(result)} chars):\n{result}")Warning
Set
export GROQ_API_KEY="gsk_..." before running. Keys are available at console.groq.com.Expected Output
Console outputCopy
text
╭── TraceLLM Trace ───────────────────────────── SUCCESS ──╮ │ │ │ Trace ID tr_d7c4b2e9 │ │ Prompt groq_inference │ │ Model llama-3.3-70b-versatile │ │ Project multi-provider │ │ Environment development │ │ Latency 542.18 ms │ │ Token Count 267 │ │ Retries 0 │ │ Steps 1 │ │ Status SUCCESS │ │ │ ╰──────────────────────────────────────────────────────────────╯ # Tool Duration Status Detail 1 openai_chat 542ms OK Response (891 chars): Groq's LPU (Language Processing Unit) achieves low latency by using a deterministic, sequential processor architecture specifically designed for LLM inference workloads. Unlike GPUs, which rely on massive parallel SIMT execution and face memory bandwidth bottlenecks from HBM, the LPU eliminates the need for external memory lookups during autoregressive decoding. Its near-calculator compute model enables tokens to be processed in a single pass through the silicon, reducing per-token latency by 10-50x compared to GPU-based inference for models like Llama. This makes Groq ideal for real-time applications where response time is critical.
Dashboard Result
Open http://localhost:3000/traces to see the Groq trace:
Dashboard UICopy
text
TraceLLM Dashboard > Traces
Status Trace ID Prompt Model Latency Tokens Time
─────── ─────────────── ───────────────────────── ────────────────────────── ────────── ──────── ─────────────────────
● Success tr_d7c4b2e9 groq_inference llama-3.3-70b-versatile 542 ms 267 2026-05-31 14:23:45
> Detail view summary bar:
Model: llama-3.3-70b-versatile | Latency: 542 ms | Tokens: 267
Retries: 0 | Steps: 1 | At: 2026-05-31 14:23:45
> The Analytics page (/analytics) groups this trace under the
"multi-provider" project, showing it alongside OpenAI traces for
cross-provider latency and cost comparisons.Replay Result
Replay the trace to see the step execution timeline:
terminalCopy
bash
tracellm replay tr_d7c4b2e9 --speed 2.0
Replay outputCopy
text
╭────────────────── Replaying execution timeline... ──────────────────╮
│ │
│ ╭─ Replay ───────────────────────────────────────────────────────╮ │
│ │ │ │
│ │ trace_id tr_d7c4b2e9 │ │
│ │ status SUCCESS │ │
│ │ latency 542.18 ms │ │
│ │ retries 0 │ │
│ │ steps 1 │ │
│ │ │ │
│ ╰─────────────────────────────────────────────────────────────────╯ │
│ │
│ ╭─ Step 1/1 ───────────────────────────────────────╮ │
│ │ │ │
│ │ step 1/1 │ │
│ │ tool openai_chat │ │
│ │ duration 542 ms │ │
│ │ status OK │ │
│ │ input {'model': 'llama-3.3-70b-versatile', ...}│ │
│ │ output {'content': "Groq's LPU (Language...", │ │
│ │ 'usage': {'total_tokens': 267}} │ │
│ │ │ │
│ ╰────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────╯
Replay complete