Agent Observability Without the Platform

Langfuse is excellent. Honeycomb is excellent. If you are running a production AI system at scale, use one. They handle distributed tracing, sampling, retention policies, and anomaly detection in ways that a DynamoDB table cannot.

But most teams running their first AI agents are not at that scale. They have four Lambda functions, a handful of daily invocations, and a free tier budget. They need to know: did the model actually run? How much did it cost? What did it decide? Is latency getting worse over time?

This is enough for a DynamoDB table and 40 lines of Python.

One trace, every operation. No Langfuse required.

The trace record schema

The schema has to be stable. If different agents write different shapes of record, the aggregation code becomes case analysis. One function, one schema:

def write_llm_trace(
    agent_slug: str,     # "cto" | "sre" | "security" | "cost"
    model: str,          # "claude-haiku-4-5"
    input_tokens: int,
    output_tokens: int,
    latency_ms: int,
    prompt_preview: str = "",   # hashed, not stored
    decision_summary: str = "", # plain text, truncated to 300 chars
    run_id: Optional[str] = None,
    cost_usd: Optional[float] = None,
) -> Optional[str]:

The record is written to the existing team-activity DynamoDB table under PK=TYPE#llm_trace. No new table. No schema migration. The TTL is 30 days, same as all other records in the table.

What to store and what not to store

The raw prompt is the obvious first thing to store for debugging. It is also the first thing you will regret storing. If the prompt contains user data, it becomes a GDPR retention problem. If it contains API keys or internal context, it becomes a security problem. If it is large (most prompts are), it inflates your DynamoDB storage costs.

The right instrument is a hash. SHA-256 of the first 200 characters of the prompt. This gives you three useful properties without storing the content:

Deduplication. Two traces with the same prompt hash had the same input. If the CTO agent runs daily and the GitHub data hasn't changed, you will see identical hashes and identical outputs. That is useful signal.
Stability tracking. If the prompt hash changes unexpectedly, something changed upstream. The hash doesn't tell you what changed, but it tells you when to investigate.
Zero PII risk. A truncated SHA-256 hash is not reversible. You cannot reconstruct the prompt from the hash.

prompt_hash = hashlib.sha256(
    prompt_preview.encode()
).hexdigest()[:16] if prompt_preview else ""

The decision summary is different. This is the first paragraph of the model's output, truncated to 300 characters. It is not PII in this context (the CTO agent summarises public GitHub data). Storing it lets you see what the agent concluded without replaying the full interaction.

Cost estimation from token counts

Anthropic publishes token prices. Claude Haiku 4 is $0.25 per million input tokens and $1.25 per million output tokens. The estimation is straightforward:

if cost_usd is None:
    cost_usd = (input_tokens * 0.00000025) + (output_tokens * 0.00000125)

This is an estimate. Actual billing may differ by a few percent depending on caching, batch discounts, and API version. For budget monitoring purposes it is accurate enough. For exact billing, use the Anthropic usage dashboard.

Aggregating across the last 50 traces gives a running spend estimate without calling any billing API. The observability frontend computes this client-side from the trace records returned by the API.

Latency percentiles

Latency data only becomes useful when you have enough records to compute percentiles. The API returns p50 and p95 across whatever traces are in the response. With 50 records and a daily agent schedule, this represents approximately 50 daily runs.

The implementation is a simple sort:

latencies = [int(t.get("latency_ms", 0)) for t in traces if t.get("latency_ms")]
p50_ms = sorted(latencies)[len(latencies) // 2] if latencies else 0
p95_ms = sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0

This is not a sliding window. It is not a time-series. For a production system with thousands of daily invocations you would use CloudWatch Metrics or a dedicated APM tool. For a system with one invocation per day, this is enough.

The run_id correlation field

The CTO agent writes a run record first, then an LLM trace. The trace includes the SK of the run record as run_id. This lets the trace viewer show which run a given LLM call belongs to, and vice versa.

run_sk = write_run(agent_slug="cto", status="succeeded", ...)
write_llm_trace(
    agent_slug="cto",
    model="claude-haiku-4-5",
    ...
    run_id=run_sk,
)

The correlation is one-to-one here (one run, one LLM call). For agents that make multiple LLM calls per run (the deliberation engine, the code review pipeline), multiple trace records share the same run_id. The trace viewer can group them.

What the GSI enables

The team-activity table has a GSI on GSI1PK=AGENT#{slug}. This enables per-agent queries without a scan. Filtering traces for the CTO agent:

resp = _table().query(
    IndexName="GSI1",
    KeyConditionExpression=Key("GSI1PK").eq("AGENT#cto"),
    ScanIndexForward=False,
    Limit=20,
    FilterExpression="PK = :pk",
    ExpressionAttributeValues={":pk": "TYPE#llm_trace"},
)

The filter expression runs after the GSI query, which means it scans all agent records and filters to traces. This is acceptable at small scale. At larger scale, a composite GSI key of AGENT#{slug}#llm_trace would avoid the filter. The single-table design trades query flexibility for infrastructure simplicity.

What this leaves out

Three things that a proper observability platform handles and this implementation does not:

Sampling. Every call is recorded. At 1 invocation/day this is fine. At 100/day it remains fine. At 10,000/day DynamoDB costs become meaningful and you should add a sampling rate to write_llm_trace().

Alerting. The frontend shows you data. It does not alert when latency spikes or cost exceeds a threshold. A CloudWatch alarm on a custom metric derived from the trace records would close this gap.

Distributed traces. When the CTO agent delegates to the Security agent via A2A, the two LLM calls are in separate records. The run_id correlation only handles the CTO agent's own call. A proper distributed trace would use a trace ID that propagates through the delegation chain. This is a non-trivial addition to the A2A protocol.

Four numbers that tell you if your agents are healthy.

Patterns demonstrated

This work demonstrates three of the 21 agentic design patterns:

Pattern 11: Goal Setting. The decision summary captures what goal the agent concluded from its inputs. Storing it persistently means you can audit whether agent goals are drifting over time without replaying the full interaction.
Pattern 19: Evaluation and Monitoring. The trace records are the monitoring substrate. Token counts, latency, and cost are the primary metrics. The prompt hash enables anomaly detection.
Pattern 8: Memory and Inter-Agent Communication. The run_id field is a lightweight form of episodic memory: it links what an agent did (run record) to how it decided (LLM trace record).

Pattern references from Agentic Design Patterns by Antonio Gulli (O'Reilly, 2025). See the pattern map for the full taxonomy.

Found this useful? Buy me a coffee to keep the demos running.

ticketyboo brings governed AI development to your pull request workflow. 5 governance runs free, one-time welcome grant. No card required.

View pricing Start free →