LLM Evaluation Tools Compared: Braintrust vs Langfuse vs Arize for Real RAG Pipelines

Most teams ship LLM features without a single automated quality gate — then wonder why production behavior drifts from evals. This comparison runs Braintrust, Langfuse, and Arize through a concrete RAG chatbot harness, measuring where each platform earns its keep and where it breaks down.

Most LLM teams treat evaluation as something that happens after deployment — a manual spot-check, a few thumbs-up from stakeholders, and a prayer. That is the wrong order. A faithfulness regression in your RAG pipeline is a bug, and it deserves the same automated detection as a null pointer exception. The three platforms compared here — Braintrust, Langfuse, and Arize Phoenix — all attempt to bring that discipline to LLM evaluation, but they make very different architectural bets about what matters most.

The comparison runs against a concrete test harness: a RAG chatbot over a 50K-document internal knowledge base, GPT-4o as the generator, FAISS as the retriever, and a 200-triple eval dataset covering factual recall, refusal behavior, and citation accuracy. Every claim about platform behavior below comes from running that harness, not from reading documentation.

Why Evals Deserve the Same Rigor as Unit Tests

A software engineer who merges a PR without running pytest gets flagged in code review. An ML engineer who ships a new prompt template without running an eval suite gets a production incident three days later. The analogy is exact: an eval suite is a test suite, a drop in faithfulness score is a failing test, and a CI gate that blocks deployment on score regression is the equivalent of a red build.

The three dimensions that separate useful LLM evaluation tools from expensive dashboards are: (1) automated scoring quality, specifically how well LLM-as-judge rubrics correlate with human judgment; (2) CI/CD integration depth, meaning how little glue code you write to block a bad deploy; and (3) human annotation workflow, which determines whether domain experts can review model outputs at throughput. A fourth axis, self-host economics, becomes decisive for teams in regulated industries.

None of these platforms are equivalent on all four axes. Picking the wrong one does not just waste money — it shapes how your team thinks about quality, and that shapes what regressions you catch.

The Test Harness: A RAG Chatbot with Teeth

Dataset construction and labeling protocol

The 200-triple eval dataset is the asset. The platform is infrastructure around it. Each triple contains a question, one to five retrieved context chunks, a ground-truth answer, and metadata fields for question type and source document category. Question types break down roughly into factoid lookups, multi-hop reasoning questions that require synthesizing two or more documents, and out-of-scope questions where the correct behavior is a refusal. Ground-truth answers were produced by two domain experts working independently, with disagreements resolved by a third reviewer.

The data schema every platform ingests looks like this:

{
  "question": "What is the maximum retention period for audit logs under policy v3.2?",
  "context_chunks": [
    "Section 4.1: Audit logs must be retained for a minimum of 90 days...",
    "Appendix B: Extended retention applies to logs flagged for legal hold..."
  ],
  "ground_truth_answer": "90 days standard; indefinite for legal hold cases.",
  "metadata": {
    "question_type": "factoid",
    "source_category": "compliance",
    "difficulty": "medium"
  }
}

Adapter differences between platforms are minor. Braintrust expects a flat dict with a named output field. Langfuse attaches scores to trace IDs. Arize ingests via an OpenTelemetry-compatible span format. The schema above maps to all three with fewer than twenty lines of transformation code.

Baseline metrics before any eval platform

Before touching any platform, RAGAS scores were computed independently using the RAGAS library directly against the 200-triple dataset. The baseline gives a cross-check against whatever each platform reports.

Metric	Baseline (RAGAS direct)	Question types covered
Faithfulness	0.74	All 200
Answer Relevancy	0.81	All 200
Context Precision	0.68	Factoid + multi-hop (160)

Baseline RAGAS scores computed directly, before any platform instrumentation. These anchor the comparison — platform-reported numbers that diverge significantly from these warrant investigation into scorer implementation differences.

One practical note on iteration speed: running LLM-judge calls across 200 examples repeatedly gets expensive and slow if you use GPT-4o as the judge model. Groq's inference API, which runs on custom Language Processing Units, reduces judge model latency dramatically. For tight eval loops during development, routing judge calls through Groq rather than OpenAI cuts wall-clock time for a full 200-example run from several minutes to under a minute. That difference matters when you are iterating on rubric definitions.

Honest caveat: 200 examples is enough to detect regressions reliably, but the confidence intervals on absolute score values are wide. A faithfulness score of 0.74 versus 0.71 is not a meaningful difference at this sample size. Use these evals to catch directional regressions, not to make fine-grained comparisons between prompt variants.

Braintrust: The Developer-First Eval Workbench

LLM-as-judge rubric configuration

Braintrust's core abstraction is the Experiment: a versioned run of your eval dataset against a specific prompt and model combination, with scores stored and automatically diffed against a named baseline. The diff view is genuinely useful — you see not just aggregate score changes but which specific examples regressed, which is where debugging actually happens.

Defining a custom rubric for citation accuracy looks like this:

from braintrust import Eval
from autoevals import LLMClassifier

citation_accuracy = LLMClassifier(
    name="CitationAccuracy",
    prompt_template="""
    Given the answer and the context chunks below, does the answer
    only make claims that are directly supported by the context?
    Answer YES or NO and explain briefly.
    Context: {{context_chunks}}
    Answer: {{output}}
    """,
    choice_scores={"YES": 1.0, "NO": 0.0},
    use_cot=True
)

Eval(
    "rag-citation-eval",
    data=lambda: load_eval_dataset(),
    task=lambda input: run_rag_pipeline(input["question"]),
    scores=[citation_accuracy]
)

The built-in autoevals library covers faithfulness, answer correctness, and several other standard rubrics out of the box. For RAG-specific dimensions like citation accuracy or retrieval attribution, you write a Python function returning a float between 0 and 1. The ergonomics are clean — this is the strongest part of Braintrust's product.

CI/CD gate integration via the SDK

Braintrust's CLI exits non-zero when a named score drops below a threshold, making it drop-in compatible with GitHub Actions. A minimal workflow step looks like:

- name: Run RAG eval suite
  run: |
    npx braintrust eval src/evals/rag_eval.ts \
      --threshold CitationAccuracy=0.70 \
      --threshold Faithfulness=0.72
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}

Human annotation in Braintrust is functional but clearly secondary. The UI supports label overrides and reviewer comments, but the workflow is optimized for developer iteration, not for routing tasks to a team of domain expert annotators. If your annotation process involves assignment queues, inter-annotator agreement tracking, or reconciliation workflows, Braintrust will frustrate you. The pricing model is usage-based on logged events, and self-hosting is not officially supported as of mid-2024 — a hard constraint for teams with data residency requirements.

Langfuse: Observability-First with Eval Bolted On

Tracing architecture and how evals attach to traces

Langfuse's primary abstraction is the Trace, not the Experiment. Every LLM call, retrieval step, and reranking operation is logged as a nested span. Evals are scores that attach to a trace or to a specific span, which means you can score the retrieval step independently from the generation step. For RAG debugging, this is architecturally important: a faithfulness failure might originate in retrieval (wrong chunks surfaced) or in generation (model hallucinated despite correct chunks). Span-level scoring lets you localize the failure.

Instrumenting the RAG pipeline with the Langfuse Python SDK:

from langfuse.decorators import observe, langfuse_context

@observe(name="retriever")
def retrieve_chunks(question: str) -> list[str]:
    return faiss_retriever.query(question, top_k=5)

@observe(name="generator")
def generate_answer(question: str, chunks: list[str]) -> str:
    return gpt4o_client.complete(question, context=chunks)

@observe(name="rag-pipeline")
def rag_pipeline(question: str) -> str:
    chunks = retrieve_chunks(question)
    answer = generate_answer(question, chunks)
    # Attach faithfulness score to the generator span
    langfuse_context.score_current_observation(
        name="faithfulness",
        value=compute_faithfulness(answer, chunks)
    )
    return answer

The waterfall view in the Langfuse UI renders retriever latency, generator latency, and token counts as a timeline — familiar to anyone who has used distributed tracing tools. This makes it straightforward to correlate a faithfulness drop with a retriever latency spike, which often signals index issues rather than prompt issues.

Self-host deployment and operational cost

Langfuse is MIT licensed and ships a Docker Compose stack that runs Postgres, ClickHouse for analytics, and the Next.js application. For a team logging tens of millions of spans per month, the ClickHouse instance becomes the dominant infrastructure cost — both in compute and in operational attention. ClickHouse is powerful but not trivial to operate, and teams without dedicated infrastructure engineers should budget time for it.

Teams that already use Kestra for workflow orchestration can treat Langfuse eval pipeline jobs as scheduled Kestra flows — nightly full eval runs against production traces, for example, fit naturally into Kestra's DAG model without custom cron infrastructure. Teams managing cloud deployments with Humanitec can register the Langfuse stack as an internal developer platform service, reducing the operational burden on individual ML teams who just need a working eval backend.

The honest criticism: Langfuse's built-in LLM-as-judge rubrics are less opinionated than Braintrust's. You get flexibility, but you write significantly more boilerplate to assemble an equivalent scoring pipeline. The dataset management UI is also noticeably rougher than Braintrust's — version diffing and experiment comparison require more manual work.

Arize Phoenix: The MLOps Veteran's Approach

Embedding-based drift detection vs. rubric scoring

Arize approaches LLM evaluation with a background in traditional ML monitoring, and it shows. Phoenix (the open-source product) focuses on trace observability and embedding drift detection. The cloud Arize platform adds production monitoring, A/B experiment tracking, and annotation queues with assignment routing.

The embedding drift feature is a genuinely different signal from rubric scores. Phoenix can cluster your RAG query embeddings and flag when production queries drift away from the distribution your eval dataset covers. This catches unknown unknowns — query types your eval dataset does not represent, which means your rubric scores look fine while real user queries are failing silently. No amount of LLM-as-judge scoring catches this; it requires looking at the embedding space.

Phoenix ingests traces via an OpenTelemetry-compatible SDK, which is a real advantage for teams already instrumenting services with standard OTEL tooling:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from openinference.instrumentation.openai import OpenAIInstrumentor
import phoenix as px

px.launch_app()
OpenAIInstrumentor().instrument()

# Existing OTEL spans are automatically captured
# No Arize-specific decorators required

The OTEL compatibility means the instrumentation is portable. If you switch from Phoenix to a different backend, the trace emission code does not change.

Human-in-the-loop annotation at scale

Arize's annotation workflow is the most mature of the three for teams with dedicated QA or domain expert reviewers. It supports assignment queues, inter-annotator agreement tracking, and label reconciliation — features that Braintrust and Langfuse treat as afterthoughts. If your quality process involves routing ambiguous outputs to specific subject matter experts and reconciling disagreements, this is where Arize Cloud justifies its cost.

Teams building voice or multimodal RAG pipelines using tools like Voiceflow for conversation design will find Arize's span-level scoring more adaptable to non-text modalities than Braintrust's experiment-centric model. A voice turn is a span; a document retrieval is a span; scoring can attach to either without restructuring the pipeline.

The honest criticism: the boundary between open-source Phoenix and paid Arize Cloud is not always clear in the documentation. Several features you would expect to find in the OSS tool — particularly around annotation workflow and experiment comparison at scale — require the cloud product. Budget time to map that boundary before committing to Phoenix as a self-hosted solution.

Head-to-Head: The Comparison Table

Dimension	Braintrust	Langfuse	Arize Phoenix (OSS)	Arize Cloud
LLM-as-judge rubric quality	Strong — opinionated defaults, autoevals library included	Adequate — flexible but requires boilerplate	Adequate — Phoenix evals library covers basics	Strong — adds rubric templates and managed scoring
Custom scorer ergonomics	Strong — Python function returning float, minimal ceremony	Adequate — SDK supports custom scores, less structured	Adequate — OTEL-compatible, more setup required	Strong — UI-assisted rubric builder
CI/CD gate support	Strong — CLI exits non-zero on threshold breach, one command	Adequate — requires custom Python script to implement pass/fail	Weak — no native CI gate; API polling required	Adequate — API available, still requires glue code
Trace/span observability depth	Adequate — experiment-level, not span-level	Strong — full span waterfall, retriever vs. generator attribution	Strong — OTEL-native, embedding cluster views	Strong — adds production drift alerting
Human annotation workflow	Weak — basic label override, no queue management	Adequate — human review queue exists, limited routing	Weak — minimal in OSS version	Strong — assignment queues, IAA tracking, reconciliation
Dataset versioning	Strong — first-class experiment diffing	Adequate — dataset management UI is rough	Adequate — dataset management via API	Strong — managed datasets with version history
Self-host availability	Weak — not officially supported	Strong — Docker Compose, MIT licensed	Strong — lightweight, SQLite or Postgres	Weak — cloud only
Open-source license	No	Yes (MIT)	Yes (Apache 2.0)	No
Pricing model	Usage-based on logged events	Free OSS; cloud tier available	Free OSS	Enterprise pricing
RAG-specific: retrieval span scoring	Weak — no native retriever span concept	Strong — score any span independently	Strong — OTEL spans map directly to retriever/generator	Strong — adds retriever performance dashboards
RAG-specific: context window visibility	Adequate — logged in experiment output	Strong — rendered in span detail view	Strong — chunk-level visibility in trace UI	Strong
RAG-specific: retriever vs. generator attribution	Weak — requires manual score separation	Strong — architectural default	Strong — architectural default	Strong

Table reflects platform behavior as tested against the 50K-document RAG harness described above. Ratings apply to RAG pipeline evaluation specifically, not general-purpose LLM applications.

CI/CD Integration in Practice: Where Each Platform Earns or Loses Points

A GitHub Actions workflow that runs the 200-example eval suite on every PR to the prompt configuration looks different across all three platforms, and the differences matter at 2am when a deploy is blocked and you need to understand why.

Braintrust has the cleanest integration. One CLI command, a threshold flag per metric, and the job exits non-zero on failure. The eval artifact in the CI log includes a URL to the full experiment diff in the Braintrust UI. Total glue code: zero lines beyond the eval definition itself.

Langfuse requires a short Python script that queries the Langfuse API after the eval run, computes pass/fail against your thresholds, and exits with the appropriate code. This is not burdensome, but it is glue code you own and maintain. The upside is that you have full control over the pass/fail logic — composite scoring rules, weighted thresholds, and exception handling for specific question types are all straightforward to implement.

Arize requires the most setup for a CI gate but produces the richest experiment metadata in the CI artifact. The tradeoff is worth it for teams that need detailed regression reports, less so for teams that just need a green/red signal.

The latency of running 200 LLM-judge calls in CI is a real bottleneck regardless of platform. At typical hosted model speeds, a full eval run can take long enough to become a developer experience problem. Routing judge model calls through Groq reduces this substantially — the speed difference between Groq's LPU-based inference and standard hosted endpoints is significant enough that it changes whether developers actually wait for the eval result or skip it. That behavioral change has quality consequences.

Teams using Kestra for pipeline orchestration can separate the PR-gate eval (fast, 200 examples, judge via Groq) from a nightly comprehensive eval (all production traces from the past 24 hours, slower judge model, full RAGAS suite). Kestra's scheduling and dependency management handle this separation cleanly without custom cron infrastructure.

One thing none of these platforms solve automatically: eval dataset drift. As your RAG pipeline evolves to cover new document categories or question types, the 200-triple dataset becomes stale. That is a human process — someone has to review production failures, identify new failure modes, and add labeled examples. The platforms can surface candidates for labeling, but the labeling decision requires domain judgment.

Self-Host Economics and Data Residency

For teams in regulated industries, self-hosting is not a preference — it is a requirement. Langfuse and Arize Phoenix both have documented self-host paths. Braintrust does not, which eliminates it from consideration for these teams before any other evaluation criteria matter.

Self-hosting Langfuse means operating Postgres, ClickHouse, and the Next.js application. At low trace volumes, this is manageable. At tens of millions of spans per month, ClickHouse becomes the dominant cost driver — both in cloud compute and in engineering attention. ClickHouse is not a database that runs itself. Teams without infrastructure engineers who have operated columnar stores should factor that operational burden into the decision.

Phoenix self-hosting is lighter. SQLite works for development; Postgres is the production recommendation. There is no ClickHouse dependency, which means the analytics capabilities are more limited but the operational surface is much smaller. Phoenix is a good fit for teams that want local observability during development and are willing to accept a cloud platform for production-scale analytics.

Teams managing cloud infrastructure with Humanitec can register Langfuse or Phoenix as internal developer platform services, giving individual ML teams a working eval backend without requiring each team to own its deployment. The internal platform team manages the infrastructure; the ML teams consume an API. This pattern scales well across organizations with multiple product teams running independent RAG pipelines.

The decision framework for self-hosting: if data residency is required, Langfuse or Phoenix. If annotation team throughput is the bottleneck, Arize Cloud. If developer iteration speed and CI integration are the priority and data residency is not a constraint, Braintrust.

Which Platform Fits Which Team

Three team archetypes cover most of the decision space:

Archetype 1: Small ML engineering team, shipping fast, no dedicated QA staff. Braintrust. The experiment-centric model matches the workflow of a team where the same engineers who write prompts also review eval results. The CI integration requires minimal glue code. Accept the self-host limitation and the weaker annotation workflow — you are not using those features anyway.

Archetype 2: Mid-size team with domain expert annotators and a compliance requirement. Langfuse self-hosted, or Arize Cloud if annotation workflow richness justifies the cost. The MIT license and data residency control are non-negotiable for this profile. If your annotators need assignment queues and inter-annotator agreement tracking, Arize Cloud's annotation features are meaningfully better than Langfuse's. If your annotators can work with a simpler queue, Langfuse self-hosted keeps data on your infrastructure at lower cost.

Archetype 3: Platform team building eval infrastructure for multiple product teams. Langfuse or Arize, because both expose APIs that allow a platform team to build standardized eval pipelines that individual product teams consume without owning the infrastructure. Braintrust's experiment model is harder to abstract into a shared service that multiple teams use with different datasets and rubrics.

Before picking any platform, instrument your RAG pipeline to emit structured traces with retriever and generator spans separated. Every platform discussed here produces better signal from better-structured data — and that instrumentation work is fully portable if you switch tools later. The traces you emit today are the eval dataset you debug with tomorrow.