
Most teams ship LLM features without a single automated quality gate — then wonder why production behavior drifts from evals. This comparison runs Braintrust, Langfuse, and Arize through a concrete RAG chatbot harness, measuring where each platform earns its keep and where it breaks down.
Most LLM teams treat evaluation as something that happens after deployment — a manual spot-check, a few thumbs-up from stakeholders, and a prayer. That is the wrong order. A faithfulness regression in your RAG pipeline is a bug, and it deserves the same automated detection as a null pointer exception. The three platforms compared here — Braintrust, Langfuse, and Arize Phoenix — all attempt to bring that discipline to LLM evaluation, but they make very different architectural bets about what matters most.
The comparison runs against a concrete test harness: a RAG chatbot over a 50K-document internal knowledge base, GPT-4o as the generator, FAISS as the retriever, and a 200-triple eval dataset covering factual recall, refusal behavior, and citation accuracy. Every claim about platform behavior below comes from running that harness, not from reading documentation.
A software engineer who merges a PR without running pytest gets flagged in code review. An ML engineer who ships a new prompt template without running an eval suite gets a production incident three days later. The analogy is exact: an eval suite is a test suite, a drop in faithfulness score is a failing test, and a CI gate that blocks deployment on score regression is the equivalent of a red build.
The three dimensions that separate useful LLM evaluation tools from expensive dashboards are: (1) automated scoring quality, specifically how well LLM-as-judge rubrics correlate with human judgment; (2) CI/CD integration depth, meaning how little glue code you write to block a bad deploy; and (3) human annotation workflow, which determines whether domain experts can review model outputs at throughput. A fourth axis, self-host economics, becomes decisive for teams in regulated industries.
None of these platforms are equivalent on all four axes. Picking the wrong one does not just waste money — it shapes how your team thinks about quality, and that shapes what regressions you catch.
The 200-triple eval dataset is the asset. The platform is infrastructure around it. Each triple contains a question, one to five retrieved context chunks, a ground-truth answer, and metadata fields for question type and source document category. Question types break down roughly into factoid lookups, multi-hop reasoning questions that require synthesizing two or more documents, and out-of-scope questions where the correct behavior is a refusal. Ground-truth answers were produced by two domain experts working independently, with disagreements resolved by a third reviewer.
The data schema every platform ingests looks like this:
{
"question": "What is the maximum retention period for audit logs under policy v3.2?",
"context_chunks": [
"Section 4.1: Audit logs must be retained for a minimum of 90 days...",
"Appendix B: Extended retention applies to logs flagged for legal hold..."
],
"ground_truth_answer": "90 days standard; indefinite for legal hold cases.",
"metadata": {
"question_type": "factoid",
"source_category": "compliance",
"difficulty": "medium"
}
}
Adapter differences between platforms are minor. Braintrust expects a flat dict with a named output field. Langfuse attaches scores to trace IDs. Arize ingests via an OpenTelemetry-compatible span format. The schema above maps to all three with fewer than twenty lines of transformation code.
Before touching any platform, RAGAS scores were computed independently using the RAGAS library directly against the 200-triple dataset. The baseline gives a cross-check against whatever each platform reports.
| Metric | Baseline (RAGAS direct) | Question types covered |
|---|---|---|
| Faithfulness | 0.74 | All 200 |
| Answer Relevancy | 0.81 | All 200 |
| Context Precision | 0.68 | Factoid + multi-hop (160) |
One practical note on iteration speed: running LLM-judge calls across 200 examples repeatedly gets expensive and slow if you use GPT-4o as the judge model. Groq's inference API, which runs on custom Language Processing Units, reduces judge model latency dramatically. For tight eval loops during development, routing judge calls through Groq rather than OpenAI cuts wall-clock time for a full 200-example run from several minutes to under a minute. That difference matters when you are iterating on rubric definitions.
Honest caveat: 200 examples is enough to detect regressions reliably, but the confidence intervals on absolute score values are wide. A faithfulness score of 0.74 versus 0.71 is not a meaningful difference at this sample size. Use these evals to catch directional regressions, not to make fine-grained comparisons between prompt variants.
Braintrust's core abstraction is the Experiment: a versioned run of your eval dataset against a specific prompt and model combination, with scores stored and automatically diffed against a named baseline. The diff view is genuinely useful — you see not just aggregate score changes but which specific examples regressed, which is where debugging actually happens.
Defining a custom rubric for citation accuracy looks like this:
from braintrust import Eval
from autoevals import LLMClassifier
citation_accuracy = LLMClassifier(
name="CitationAccuracy",
prompt_template="""
Given the answer and the context chunks below, does the answer
only make claims that are directly supported by the context?
Answer YES or NO and explain briefly.
Context: {{context_chunks}}
Answer: {{output}}
""",
choice_scores={"YES": 1.0, "NO": 0.0},
use_cot=True
)
Eval(
"rag-citation-eval",
data=lambda: load_eval_dataset(),
task=lambda input: run_rag_pipeline(input["question"]),
scores=[citation_accuracy]
)
The built-in autoevals library covers faithfulness, answer correctness, and several other standard rubrics out of the box. For RAG-specific dimensions like citation accuracy or retrieval attribution, you write a Python function returning a float between 0 and 1. The ergonomics are clean — this is the strongest part of Braintrust's product.
Braintrust's CLI exits non-zero when a named score drops below a threshold, making it drop-in compatible with GitHub Actions. A minimal workflow step looks like:
- name: Run RAG eval suite
run: |
npx braintrust eval src/evals/rag_eval.ts \
--threshold CitationAccuracy=0.70 \
--threshold Faithfulness=0.72
env:
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
Human annotation in Braintrust is functional but clearly secondary. The UI supports label overrides and reviewer comments, but the workflow is optimized for developer iteration, not for routing tasks to a team of domain expert annotators. If your annotation process involves assignment queues, inter-annotator agreement tracking, or reconciliation workflows, Braintrust will frustrate you. The pricing model is usage-based on logged events, and self-hosting is not officially supported as of mid-2024 — a hard constraint for teams with data residency requirements.
Langfuse's primary abstraction is the Trace, not the Experiment. Every LLM call, retrieval step, and reranking operation is logged as a nested span. Evals are scores that attach to a trace or to a specific span, which means you can score the retrieval step independently from the generation step. For RAG debugging, this is architecturally important: a faithfulness failure might originate in retrieval (wrong chunks surfaced) or in generation (model hallucinated despite correct chunks). Span-level scoring lets you localize the failure.
Instrumenting the RAG pipeline with the Langfuse Python SDK:
from langfuse.decorators import observe, langfuse_context
@observe(name="retriever")
def retrieve_chunks(question: str) -> list[str]:
return faiss_retriever.query(question, top_k=5)
@observe(name="generator")
def generate_answer(question: str, chunks: list[str]) -> str:
return gpt4o_client.complete(question, context=chunks)
@observe(name="rag-pipeline")
def rag_pipeline(question: str) -> str:
chunks = retrieve_chunks(question)
answer = generate_answer(question, chunks)
# Attach faithfulness score to the generator span
langfuse_context.score_current_observation(
name="faithfulness",
value=compute_faithfulness(answer, chunks)
)
return answer
The waterfall view in the Langfuse UI renders retriever latency, generator latency, and token counts as a timeline — familiar to anyone who has used distributed tracing tools. This makes it straightforward to correlate a faithfulness drop with a retriever latency spike, which often signals index issues rather than prompt issues.
Langfuse is MIT licensed and ships a Docker Compose stack that runs Postgres, ClickHouse for analytics, and the Next.js application. For a team logging tens of millions of spans per month, the ClickHouse instance becomes the dominant infrastructure cost — both in compute and in operational attention. ClickHouse is powerful but not trivial to operate, and teams without dedicated infrastructure engineers should budget time for it.
Teams that already use Kestra for workflow orchestration can treat Langfuse eval pipeline jobs as scheduled Kestra flows — nightly full eval runs against production traces, for example, fit naturally into Kestra's DAG model without custom cron infrastructure. Teams managing cloud deployments with Humanitec can register the Langfuse stack as an internal developer platform service, reducing the operational burden on individual ML teams who just need a working eval backend.
The honest criticism: Langfuse's built-in LLM-as-judge rubrics are less opinionated than Braintrust's. You get flexibility, but you write significantly more boilerplate to assemble an equivalent scoring pipeline. The dataset management UI is also noticeably rougher than Braintrust's — version diffing and experiment comparison require more manual work.
Arize approaches LLM evaluation with a background in traditional ML monitoring, and it shows. Phoenix (the open-source product) focuses on trace observability and embedding drift detection. The cloud Arize platform adds production monitoring, A/B experiment tracking, and annotation queues with assignment routing.
The embedding drift feature is a genuinely different signal from rubric scores. Phoenix can cluster your RAG query embeddings and flag when production queries drift away from the distribution your eval dataset covers. This catches unknown unknowns — query types your eval dataset does not represent, which means your rubric scores look fine while real user queries are failing silently. No amount of LLM-as-judge scoring catches this; it requires looking at the embedding space.
Phoenix ingests traces via an OpenTelemetry-compatible SDK, which is a real advantage for teams already instrumenting services with standard OTEL tooling:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from openinference.instrumentation.openai import OpenAIInstrumentor
import phoenix as px
px.launch_app()
OpenAIInstrumentor().instrument()
# Existing OTEL spans are automatically captured
# No Arize-specific decorators required
The OTEL compatibility means the instrumentation is portable. If you switch from Phoenix to a different backend, the trace emission code does not change.
Arize's annotation workflow is the most mature of the three for teams with dedicated QA or domain expert reviewers. It supports assignment queues, inter-annotator agreement tracking, and label reconciliation — features that Braintrust and Langfuse treat as afterthoughts. If your quality process involves routing ambiguous outputs to specific subject matter experts and reconciling disagreements, this is where Arize Cloud justifies its cost.
Teams building voice or multimodal RAG pipelines using tools like Voiceflow for conversation design will find Arize's span-level scoring more adaptable to non-text modalities than Braintrust's experiment-centric model. A voice turn is a span; a document retrieval is a span; scoring can attach to either without restructuring the pipeline.
The honest criticism: the boundary between open-source Phoenix and paid Arize Cloud is not always clear in the documentation. Several features you would expect to find in the OSS tool — particularly around annotation workflow and experiment comparison at scale — require the cloud product. Budget time to map that boundary before committing to Phoenix as a self-hosted solution.
| Dimension | Braintrust | Langfuse | Arize Phoenix (OSS) | Arize Cloud |
|---|---|---|---|---|
| LLM-as-judge rubric quality | Strong — opinionated defaults, autoevals library included | Adequate — flexible but requires boilerplate | Adequate — Phoenix evals library covers basics | Strong — adds rubric templates and managed scoring |
| Custom scorer ergonomics | Strong — Python function returning float, minimal ceremony | Adequate — SDK supports custom scores, less structured | Adequate — OTEL-compatible, more setup required | Strong — UI-assisted rubric builder |
| CI/CD gate support | Strong — CLI exits non-zero on threshold breach, one command | Adequate — requires custom Python script to implement pass/fail | Weak — no native CI gate; API polling required | Adequate — API available, still requires glue code |
| Trace/span observability depth | Adequate — experiment-level, not span-level | Strong — full span waterfall, retriever vs. generator attribution | Strong — OTEL-native, embedding cluster views | Strong — adds production drift alerting |
| Human annotation workflow | Weak — basic label override, no queue management | Adequate — human review queue exists, limited routing | Weak — minimal in OSS version | Strong — assignment queues, IAA tracking, reconciliation |
| Dataset versioning | Strong — first-class experiment diffing | Adequate — dataset management UI is rough | Adequate — dataset management via API | Strong — managed datasets with version history |
| Self-host availability | Weak — not officially supported | Strong — Docker Compose, MIT licensed | Strong — lightweight, SQLite or Postgres | Weak — cloud only |
| Open-source license | No | Yes (MIT) | Yes (Apache 2.0) | No |
| Pricing model | Usage-based on logged events | Free OSS; cloud tier available | Free OSS | Enterprise pricing |
| RAG-specific: retrieval span scoring | Weak — no native retriever span concept | Strong — score any span independently | Strong — OTEL spans map directly to retriever/generator | Strong — adds retriever performance dashboards |
| RAG-specific: context window visibility | Adequate — logged in experiment output | Strong — rendered in span detail view | Strong — chunk-level visibility in trace UI | Strong |
| RAG-specific: retriever vs. generator attribution | Weak — requires manual score separation | Strong — architectural default | Strong — architectural default | Strong |
Table reflects platform behavior as tested against the 50K-document RAG harness described above. Ratings apply to RAG pipeline evaluation specifically, not general-purpose LLM applications.
A GitHub Actions workflow that runs the 200-example eval suite on every PR to the prompt configuration looks different across all three platforms, and the differences matter at 2am when a deploy is blocked and you need to understand why.
Braintrust has the cleanest integration. One CLI command, a threshold flag per metric, and the job exits non-zero on failure. The eval artifact in the CI log includes a URL to the full experiment diff in the Braintrust UI. Total glue code: zero lines beyond the eval definition itself.
Langfuse requires a short Python script that queries the Langfuse API after the eval run, computes pass/fail against your thresholds, and exits with the appropriate code. This is not burdensome, but it is glue code you own and maintain. The upside is that you have full control over the pass/fail logic — composite scoring rules, weighted thresholds, and exception handling for specific question types are all straightforward to implement.
Arize requires the most setup for a CI gate but produces the richest experiment metadata in the CI artifact. The tradeoff is worth it for teams that need detailed regression reports, less so for teams that just need a green/red signal.
The latency of running 200 LLM-judge calls in CI is a real bottleneck regardless of platform. At typical hosted model speeds, a full eval run can take long enough to become a developer experience problem. Routing judge model calls through Groq reduces this substantially — the speed difference between Groq's LPU-based inference and standard hosted endpoints is significant enough that it changes whether developers actually wait for the eval result or skip it. That behavioral change has quality consequences.
Teams using Kestra for pipeline orchestration can separate the PR-gate eval (fast, 200 examples, judge via Groq) from a nightly comprehensive eval (all production traces from the past 24 hours, slower judge model, full RAGAS suite). Kestra's scheduling and dependency management handle this separation cleanly without custom cron infrastructure.
One thing none of these platforms solve automatically: eval dataset drift. As your RAG pipeline evolves to cover new document categories or question types, the 200-triple dataset becomes stale. That is a human process — someone has to review production failures, identify new failure modes, and add labeled examples. The platforms can surface candidates for labeling, but the labeling decision requires domain judgment.
For teams in regulated industries, self-hosting is not a preference — it is a requirement. Langfuse and Arize Phoenix both have documented self-host paths. Braintrust does not, which eliminates it from consideration for these teams before any other evaluation criteria matter.
Self-hosting Langfuse means operating Postgres, ClickHouse, and the Next.js application. At low trace volumes, this is manageable. At tens of millions of spans per month, ClickHouse becomes the dominant cost driver — both in cloud compute and in engineering attention. ClickHouse is not a database that runs itself. Teams without infrastructure engineers who have operated columnar stores should factor that operational burden into the decision.
Phoenix self-hosting is lighter. SQLite works for development; Postgres is the production recommendation. There is no ClickHouse dependency, which means the analytics capabilities are more limited but the operational surface is much smaller. Phoenix is a good fit for teams that want local observability during development and are willing to accept a cloud platform for production-scale analytics.
Teams managing cloud infrastructure with Humanitec can register Langfuse or Phoenix as internal developer platform services, giving individual ML teams a working eval backend without requiring each team to own its deployment. The internal platform team manages the infrastructure; the ML teams consume an API. This pattern scales well across organizations with multiple product teams running independent RAG pipelines.
The decision framework for self-hosting: if data residency is required, Langfuse or Phoenix. If annotation team throughput is the bottleneck, Arize Cloud. If developer iteration speed and CI integration are the priority and data residency is not a constraint, Braintrust.
Three team archetypes cover most of the decision space:
Archetype 1: Small ML engineering team, shipping fast, no dedicated QA staff. Braintrust. The experiment-centric model matches the workflow of a team where the same engineers who write prompts also review eval results. The CI integration requires minimal glue code. Accept the self-host limitation and the weaker annotation workflow — you are not using those features anyway.
Archetype 2: Mid-size team with domain expert annotators and a compliance requirement. Langfuse self-hosted, or Arize Cloud if annotation workflow richness justifies the cost. The MIT license and data residency control are non-negotiable for this profile. If your annotators need assignment queues and inter-annotator agreement tracking, Arize Cloud's annotation features are meaningfully better than Langfuse's. If your annotators can work with a simpler queue, Langfuse self-hosted keeps data on your infrastructure at lower cost.
Archetype 3: Platform team building eval infrastructure for multiple product teams. Langfuse or Arize, because both expose APIs that allow a platform team to build standardized eval pipelines that individual product teams consume without owning the infrastructure. Braintrust's experiment model is harder to abstract into a shared service that multiple teams use with different datasets and rubrics.
Before picking any platform, instrument your RAG pipeline to emit structured traces with retriever and generator spans separated. Every platform discussed here produces better signal from better-structured data — and that instrumentation work is fully portable if you switch tools later. The traces you emit today are the eval dataset you debug with tomorrow.
Comments below are reflections from our AI content panel. Each commenter is a named character with a distinct perspective — meet them →
wait but how do you actually run these evals without manually labeling a bunch of outputs first? like the post mentions a 200-triple eval dataset but doesn't say how long that took to build or whether these tools help you construct one, which feels like the real blocker for teams that just want to start somewhere
Data science practitioner and technical writer. Covers analytics, ML tooling, and the data infrastructure stack.
AI software insights, comparisons, and industry analysis from the TopReviewed team.