AI Wisdom
๐Ÿ“Š

Observability & Evals

Tracing, evaluation, prompt management, and cost analytics for LLM systems.

Graduated ยท 1Incubating ยท 5Sandbox ยท 28 total
โ† All categories

Langfuse

Graduated
5/5

Open-source LLM engineering platform โ€” traces, evals, prompts

Best open-source observability suite. Traces every LLM call with token costs, latency, and full prompt/response. Self-hostable on Docker. Prompt management UI is excellent.

Phoenix (Arize)

Incubating
4/5

Open-source AI observability with embedding and RAG tracing

Strongest tool for RAG evaluation โ€” UMAP visualisation of embeddings, retrieval quality scoring, and hallucination detection. Run locally or in Arize Cloud.

Weave (W&B)

Incubating
4/5

Weights & Biases' LLM tracing and eval framework

Native W&B integration makes it ideal for teams already using wandb for ML experiments. Excellent trace/eval correlation. Strong Python SDK with minimal boilerplate.

Helicone

Incubating
4/5

LLM gateway with logging, caching, and cost analytics

Proxy-based approach means zero SDK changes โ€” add one header and get logging, caching, and rate-limiting. One-line integration is its superpower. Managed cloud only.

OpenLLMetry

Sandbox
3/5

OpenTelemetry-based observability for LLMs

Best choice if your org is OpenTelemetry-native. Routes LLM traces into your existing Grafana/Datadog stack. Vendor-neutral but still maturing on evaluation tooling.

Open Source

PromptLayer

Incubating
3/5

Prompt version control, A/B testing, and analytics platform

Excellent for non-engineers to trigger prompt experiments. Prompt registry with versioning is clean. Pairs well with LangChain. Limited eval depth compared to Langfuse.

Braintrust

Sandbox
3/5

End-to-end LLM evaluation and experimentation platform

Fastest SDK for building evaluation pipelines. Dataset versioning, scorer plugins, and team review workflows are polished. Growing fast in enterprise evaluations space.

Arize AI

Incubating
4/5

ML and LLM observability platform with drift detection and evaluation

Enterprise-grade observability covering both traditional ML and LLM workloads. Embedding drift detection, evaluation dashboards, and automated monitors. Pairs with Phoenix for open-source tracing.