📊

Observability & Evals

Tracing, evaluation, prompt management, and cost analytics for LLM systems.

Production · 1Stable · 5Experimental · 28 total

Arize AI

Stable

4/5

ML and LLM observability platform with drift detection and evaluation

Enterprise-grade observability covering both traditional ML and LLM workloads. Embedding drift detection, evaluation dashboards, and automated monitors. Pairs with Phoenix for open-source tracing.

Managed

Docs ↗

Braintrust

Experimental

3/5

End-to-end LLM evaluation and experimentation platform

Fastest SDK for building evaluation pipelines. Dataset versioning, scorer plugins, and team review workflows are polished. Growing fast in enterprise evaluations space.

Managed

Docs ↗

Helicone

Stable

4/5

LLM gateway with logging, caching, and cost analytics

Proxy-based approach means zero SDK changes — add one header and get logging, caching, and rate-limiting. One-line integration is its superpower. Managed cloud only.

Managed

Docs ↗

Langfuse

Production

5/5

Open-source LLM engineering platform — traces, evals, prompts

Best open-source observability suite. Traces every LLM call with token costs, latency, and full prompt/response. Self-hostable on Docker. Prompt management UI is excellent.

Hybrid

Article →Docs ↗

OpenLLMetry

Experimental

3/5

OpenTelemetry-based observability for LLMs

Best choice if your org is OpenTelemetry-native. Routes LLM traces into your existing Grafana/Datadog stack. Vendor-neutral but still maturing on evaluation tooling.

Open Source

Docs ↗

Phoenix (Arize)

Stable

4/5

Open-source AI observability with embedding and RAG tracing

Strongest tool for RAG evaluation — UMAP visualisation of embeddings, retrieval quality scoring, and hallucination detection. Run locally or in Arize Cloud.

Hybrid

Docs ↗

PromptLayer

Stable

3/5

Prompt version control, A/B testing, and analytics platform

Excellent for non-engineers to trigger prompt experiments. Prompt registry with versioning is clean. Pairs well with LangChain. Limited eval depth compared to Langfuse.

Managed

Docs ↗

Weave (W&B)

Stable

4/5

Weights & Biases' LLM tracing and eval framework

Native W&B integration makes it ideal for teams already using wandb for ML experiments. Excellent trace/eval correlation. Strong Python SDK with minimal boilerplate.

Managed

Docs ↗

Other Categories

🧠Text Generation & Reasoning20 tools 💻Code Generation12 tools 🎨Image Generation12 tools 🎬Video Generation10 tools 🎙️Speech & Audio10 tools 🔗Embedding Models10 tools 👁️Multimodal & Vision10 tools 🤖AI Agents & Platforms10 tools 🔧Frameworks & SDKs12 tools ☁️Cloud AI Platforms10 tools 🗄️Vector Databases10 tools 🔬Data & Fine-tuning10 tools 🚀Model Serving8 tools 🛡️Guardrails & Safety8 tools