Langfuse is excellent. Honeycomb is excellent. If you are running a production AI system at scale, use one. They handle distributed tracing, sampling, retention policies, and anomaly detection in ways that a DynamoDB table cannot.

But most teams running their first AI agents are not at that scale. They have four Lambda functions, a handful of daily invocations, and a free tier budget. They need to know: did the model actually run? How much did it cost? What did it decide? Is latency getting worse over time?

This is enough for a DynamoDB table and 40 lines of Python.

SRE Agent heartbeat 2026-03-27T08:00:00Z · 250ms · ✓ run_id: sre-20260327-0800 CloudWatch GetMetricData tool · 120ms · success $0.00 Lambda GetFunctionConfiguration tool · 85ms · success $0.00 DynamoDB PutItem write · 45ms · success $0.00 Run summary Total duration: 250ms LLM calls: 0 Tool calls: 3 Tokens used: 0 Cost: $0.00 No Langfuse required. node types agent run (root) tool call LLM call write
One trace, every operation. No Langfuse required.

The trace record schema

The schema has to be stable. If different agents write different shapes of record, the aggregation code becomes case analysis. One function, one schema:

def write_llm_trace(
    agent_slug: str,     # "cto" | "sre" | "security" | "cost"
    model: str,          # "claude-haiku-4-5"
    input_tokens: int,
    output_tokens: int,
    latency_ms: int,
    prompt_preview: str = "",   # hashed, not stored
    decision_summary: str = "", # plain text, truncated to 300 chars
    run_id: Optional[str] = None,
    cost_usd: Optional[float] = None,
) -> Optional[str]:

The record is written to the existing team-activity DynamoDB table under PK=TYPE#llm_trace. No new table. No schema migration. The TTL is 30 days, same as all other records in the table.

What to store and what not to store

The raw prompt is the obvious first thing to store for debugging. It is also the first thing you will regret storing. If the prompt contains user data, it becomes a GDPR retention problem. If it contains API keys or internal context, it becomes a security problem. If it is large (most prompts are), it inflates your DynamoDB storage costs.

The right instrument is a hash. SHA-256 of the first 200 characters of the prompt. This gives you three useful properties without storing the content:

prompt_hash = hashlib.sha256(
    prompt_preview.encode()
).hexdigest()[:16] if prompt_preview else ""

The decision summary is different. This is the first paragraph of the model's output, truncated to 300 characters. It is not PII in this context (the CTO agent summarises public GitHub data). Storing it lets you see what the agent concluded without replaying the full interaction.

Cost estimation from token counts

Anthropic publishes token prices. Claude Haiku 4 is $0.25 per million input tokens and $1.25 per million output tokens. The estimation is straightforward:

if cost_usd is None:
    cost_usd = (input_tokens * 0.00000025) + (output_tokens * 0.00000125)

This is an estimate. Actual billing may differ by a few percent depending on caching, batch discounts, and API version. For budget monitoring purposes it is accurate enough. For exact billing, use the Anthropic usage dashboard.

Aggregating across the last 50 traces gives a running spend estimate without calling any billing API. The observability frontend computes this client-side from the trace records returned by the API.

Latency percentiles

Latency data only becomes useful when you have enough records to compute percentiles. The API returns p50 and p95 across whatever traces are in the response. With 50 records and a daily agent schedule, this represents approximately 50 daily runs.

The implementation is a simple sort:

latencies = [int(t.get("latency_ms", 0)) for t in traces if t.get("latency_ms")]
p50_ms = sorted(latencies)[len(latencies) // 2] if latencies else 0
p95_ms = sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0

This is not a sliding window. It is not a time-series. For a production system with thousands of daily invocations you would use CloudWatch Metrics or a dedicated APM tool. For a system with one invocation per day, this is enough.

The run_id correlation field

The CTO agent writes a run record first, then an LLM trace. The trace includes the SK of the run record as run_id. This lets the trace viewer show which run a given LLM call belongs to, and vice versa.

run_sk = write_run(agent_slug="cto", status="succeeded", ...)
write_llm_trace(
    agent_slug="cto",
    model="claude-haiku-4-5",
    ...
    run_id=run_sk,
)

The correlation is one-to-one here (one run, one LLM call). For agents that make multiple LLM calls per run (the deliberation engine, the code review pipeline), multiple trace records share the same run_id. The trace viewer can group them.

What the GSI enables

The team-activity table has a GSI on GSI1PK=AGENT#{slug}. This enables per-agent queries without a scan. Filtering traces for the CTO agent:

resp = _table().query(
    IndexName="GSI1",
    KeyConditionExpression=Key("GSI1PK").eq("AGENT#cto"),
    ScanIndexForward=False,
    Limit=20,
    FilterExpression="PK = :pk",
    ExpressionAttributeValues={":pk": "TYPE#llm_trace"},
)

The filter expression runs after the GSI query, which means it scans all agent records and filters to traces. This is acceptable at small scale. At larger scale, a composite GSI key of AGENT#{slug}#llm_trace would avoid the filter. The single-table design trades query flexibility for infrastructure simplicity.

What this leaves out

Three things that a proper observability platform handles and this implementation does not:

Sampling. Every call is recorded. At 1 invocation/day this is fine. At 100/day it remains fine. At 10,000/day DynamoDB costs become meaningful and you should add a sampling rate to write_llm_trace().

Alerting. The frontend shows you data. It does not alert when latency spikes or cost exceeds a threshold. A CloudWatch alarm on a custom metric derived from the trace records would close this gap.

Distributed traces. When the CTO agent delegates to the Security agent via A2A, the two LLM calls are in separate records. The run_id correlation only handles the CTO agent's own call. A proper distributed trace would use a trace ID that propagates through the delegation chain. This is a non-trivial addition to the A2A protocol.

tokens / day last 7 days cost / day ($) $1 last 7 days latency p50 210ms p95 410ms Lambda p95 target: <500ms error rate 0.3% down from 0.8% last 30 days
Four numbers that tell you if your agents are healthy.

Patterns demonstrated

This work demonstrates three of the 21 agentic design patterns:

Pattern references from Agentic Design Patterns by Antonio Gulli (O'Reilly, 2025). See the pattern map for the full taxonomy.

Found this useful? Buy me a coffee to keep the demos running.

Related
Live trace viewer demo A2A protocol in 200 lines Replacing Paperclip with Lambda ops agents 21 agentic patterns mapped

ticketyboo brings governed AI development to your pull request workflow. 5 governance runs free, one-time welcome grant. No card required.

View pricing Start free →