Key takeaways
- LLM observability is not just logging -- it's about catching when your AI application produces fast, confident, wrong answers
- Langfuse is the go-to open-source option for most teams; LangSmith is the natural fit if you're already deep in LangChain
- Arize AI (via Phoenix) leads on evaluation and OpenTelemetry-native tracing; Helicone wins on setup speed and cost visibility
- The right tool depends on your stack, team size, and whether you need evaluation depth or just traffic/cost monitoring
- Several tools in this space are monitoring-only -- they show you data but don't help you act on it
Shipping an LLM-powered product and not knowing what's happening inside it is a specific kind of uncomfortable. You can see latency in your APM tool. You can see error rates. But you can't see whether GPT-4o just hallucinated a product feature to your customer, or whether your RAG pipeline is retrieving the wrong chunks 30% of the time.
That's the gap LLM observability tools are trying to close. The market has grown fast -- Research and Markets estimated it at $2.69 billion in 2026, projected to hit $9.26 billion by 2030. There are now 15+ tools competing for this space, and the differences between them matter more than the marketing suggests.
This guide focuses on the four tools teams most commonly evaluate: Langfuse, Arize AI (and its open-source sibling Phoenix), Helicone, and LangSmith. I'll also mention a few others worth knowing about.
What LLM observability actually means
Traditional observability -- logs, metrics, traces -- tells you whether a request completed and how long it took. That's necessary but not sufficient for LLM applications.
The harder problem: an LLM response can be fast, grammatically correct, on-brand, and completely wrong. It can hallucinate a citation. It can answer a question about your return policy with confident nonsense. None of that shows up in your p99 latency chart.
LLM observability adds a behavioral layer:
- Tracing: What happened at each step? Which prompt was sent, which model responded, what did the retrieval return?
- Evaluation: Was the response actually good? Did it stay grounded in the context? Was it safe?
- Cost tracking: Which users, features, or prompt templates are burning tokens?
- Monitoring and alerting: When quality drifts, when costs spike, when a new prompt version performs worse
Some tools do all four. Some focus on one or two. That distinction matters when you're picking one.
The four tools compared
Langfuse
Langfuse is open-source (MIT licensed), self-hostable, and has become the default choice for teams that want full control over their data. It handles tracing well across complex multi-step pipelines and agents, has a clean UI for inspecting individual traces, and supports prompt versioning so you can track which version of a prompt is live in production.
The evaluation story is decent but not deep out of the box -- you can attach scores to traces manually or via LLM-as-judge, but you're largely building your own evaluation logic. For teams that want to define their own quality metrics and run them on production traffic, that flexibility is a feature. For teams that want 50 pre-built metrics ready to go, it's a gap.
Self-hosting is genuinely viable with Langfuse. The Docker setup is well-documented, and the cloud version is available if you don't want to manage infrastructure. Pricing on the cloud tier is reasonable for early-stage teams.
Best for: Teams that want open-source, self-hosting, or privacy-sensitive workloads. Also good for teams building on multiple frameworks (not just LangChain).
Arize AI and Phoenix
Arize AI is the commercial platform; Phoenix is its open-source evaluation and tracing library. They're related but distinct -- Phoenix can be used standalone, while Arize adds enterprise features, a managed platform, and deeper monitoring capabilities.
Phoenix's standout feature is OpenTelemetry-native tracing. If you care about avoiding vendor lock-in and want your LLM traces to fit into the same observability stack as the rest of your infrastructure, that matters. It also has strong support for RAG evaluation -- retrieval quality metrics, context relevance, groundedness -- which makes it a natural fit for teams building document Q&A or knowledge base applications.
The evaluation depth is genuinely impressive. Arize has invested heavily in research-backed metrics, and the platform surfaces quality issues in production rather than just showing you traffic data.
The tradeoff: the commercial Arize platform is priced for enterprise teams. Phoenix is free and powerful, but you're doing more self-configuration. The learning curve is steeper than Helicone.
Best for: Teams building RAG applications, teams that need OpenTelemetry compatibility, enterprise teams that want managed evaluation at scale.
Helicone
Helicone sits in a different category from the other three. It's primarily an AI gateway -- requests to OpenAI, Anthropic, and other providers route through Helicone's proxy, which logs everything automatically. Setup is genuinely fast: change your base URL, add an API key header, and you have cost tracking and request logging in minutes.
That simplicity is real. For teams that want visibility into costs and latency without a complex integration, Helicone is hard to beat. It also has caching built in, which can meaningfully reduce costs for applications with repeated queries.
The limitation is depth. Helicone doesn't have strong agent tracing for multi-step workflows, and the evaluation capabilities are basic compared to Langfuse or Arize. If you need to understand why a response was bad, not just that it was slow or expensive, Helicone alone won't get you there.
Best for: Teams that want fast setup, cost visibility, and caching. Good as a first layer of observability before you need deeper evaluation.
LangSmith
LangSmith is LangChain's observability platform. If you're building with LangChain or LangGraph, the integration is seamless -- tracing just works, and the debugging experience for multi-step chains and agents is excellent.
The platform covers the full loop: tracing, evaluation, dataset management, and prompt testing. The UI for inspecting agent runs -- seeing each tool call, each LLM invocation, the full execution tree -- is one of the better implementations in this space.
The catch is the LangChain dependency. LangSmith works best when you're using LangChain's abstractions. If you're using raw OpenAI or Anthropic SDKs, or a different framework like LlamaIndex, the integration is more work and the experience is less polished. It's also a closed-source, commercial product, which matters if you have data residency requirements.
Best for: Teams already using LangChain or LangGraph. Strong for agent debugging specifically.
Feature comparison
| Feature | Langfuse | Arize / Phoenix | Helicone | LangSmith |
|---|---|---|---|---|
| Open source | Yes (MIT) | Phoenix yes, Arize no | Yes (core) | No |
| Self-hostable | Yes | Phoenix yes | Yes | No |
| Multi-step agent tracing | Strong | Strong | Basic | Strong (LangChain) |
| Built-in evaluation metrics | Basic | Strong | Minimal | Moderate |
| RAG-specific metrics | Manual | Strong | No | Moderate |
| Cost tracking | Yes | Yes | Strong | Yes |
| Prompt versioning | Yes | No | No | Yes |
| OpenTelemetry native | No | Yes | No | No |
| AI gateway / proxy | No | No | Yes | No |
| Caching | No | No | Yes | No |
| Dataset management | Yes | Yes | No | Yes |
| Free tier | Yes | Yes (Phoenix) | Yes | Yes |
| Best fit | Most teams | RAG / enterprise | Fast setup | LangChain users |
Other tools worth knowing
The four above aren't the only options. A few others come up regularly in team evaluations:
Braintrust is strong on prompt experimentation and evaluation workflows. If your team runs a lot of A/B tests on prompts and needs structured experiment tracking, it's worth a look.

Comet Opik (Apache 2.0) is a newer open-source option that's gained traction with teams already using Comet for ML experiment tracking.

LangWatch focuses on agent testing with simulated users and regression prevention -- useful if you're trying to catch quality regressions before they hit production.
Weights & Biases Weave extends W&B's experiment tracking into LLM territory. If your team already uses W&B for model training, Weave is a natural extension.

Humanloop covers prompt versioning and monitoring with a clean interface, particularly good for non-engineering stakeholders who need to participate in prompt management.
Promptfoo is worth mentioning for teams that want open-source LLM testing and red-teaming before production, rather than production monitoring.
How to choose
The honest answer is that the right tool depends on three things: your framework, your team's needs, and how much depth you need on evaluation.
Start with Helicone if you need something running in an afternoon and your primary concern is cost visibility and basic request logging. It's the fastest path from zero to some observability.
Choose Langfuse if you want open-source, self-hosting, or you're building on multiple frameworks. It's the most flexible all-in-one option and the community is active.
Choose LangSmith if you're building with LangChain or LangGraph and want the tightest possible integration. The agent debugging experience is genuinely good.
Choose Arize / Phoenix if you're building a RAG application and need serious evaluation depth, or if you're an enterprise team that needs managed observability with OpenTelemetry compatibility.
One thing to watch for: several tools in this space are monitoring-only. They show you traces and costs, but they don't help you evaluate quality or act on what you find. That gap matters more as your application matures. Early on, knowing what ran is enough. Later, you need to know whether it was good -- and what to do when it isn't.
A note on the broader observability picture
LLM observability tools handle the internal behavior of your AI application -- what happens inside your prompts, chains, and agents. That's different from external AI visibility, which is about how your brand and content appear when users ask AI search engines like ChatGPT or Perplexity about your product category.
If you're also thinking about the latter -- whether your company shows up when someone asks an AI model for recommendations in your space -- that's a separate problem handled by GEO platforms like Promptwatch.

They're complementary concerns: one is about the quality of AI you're building, the other is about how AI talks about you.
The evaluation gap is the real differentiator
Looking across these tools in 2026, the clearest dividing line isn't open-source vs. commercial, or self-hosted vs. cloud. It's whether a tool stops at showing you what happened or goes further to tell you whether it was good.
Tracing is table stakes now. Every tool on this list does it reasonably well. The teams that get the most value from observability are the ones using evaluation -- running quality checks on production traffic, catching regressions before users do, and building feedback loops that actually improve their application over time.
That's where Arize and Langfuse (with custom evaluation logic) pull ahead of simpler tools. And it's the capability worth prioritizing as your application moves from prototype to production.






