Promptfoo logo

Promptfoo Review

Visit

LLM evaluation and red teaming for AI applications

Promptfoo is an open-source LLM testing framework for developers and security teams building AI applications.

Promptfoo·Founded 2024·From $50/moFree PlanAI SecurityAI Coding ToolsLLM Platforms

AI Panel Score

8.5/10

6 AI reviews

Reviewed

AI Editor Approved

About Promptfoo

Users define test suites in YAML or JSON, specifying prompts, providers, and expected outputs with assertions ranging from deterministic string checks to model-graded rubrics. The CLI runs evaluations locally or in CI pipelines, producing a visual results UI that compares model responses side by side. Tests can be written against single-turn prompts, multi-turn conversations, RAG pipelines, and autonomous agents.

The red team module is a core differentiator: it includes over 80 plugins covering vulnerability categories such as SQL injection, shell injection, indirect prompt injection, BOLA, BFLA, data exfiltration, hallucination, bias, and compliance frameworks including OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, EU AI Act, GDPR, and HIPAA. Attack strategies include base64 encoding, ASCII smuggling, multi-turn escalation, and system prompt override attempts. A separate code scanning tool audits LLM-integrated codebases via CLI, GitHub Action, or VS Code extension.

Promptfoo targets AI engineers, ML platform teams, and security teams responsible for LLM-powered products. An open-source self-hosted version is available under a permissive license. An Enterprise tier adds team management, audit logging, SSO authentication, findings dashboards, remediation reports, webhooks, and managed red team infrastructure. Pricing for the enterprise tier is not publicly listed. Comparable tools in the evaluation space include LangSmith, Braintrust, and HelixML.

The tool integrates with CI/CD platforms including GitHub Actions, GitLab CI, CircleCI, Jenkins, Azure Pipelines, and Bitbucket Pipelines. Provider support spans over 60 LLM providers and deployment targets, including local models via Ollama, llama.cpp, and LM Studio, as well as cloud providers such as Google Vertex, Azure OpenAI, Hugging Face, and AWS SageMaker. The Python and Node.js APIs allow programmatic use within existing test frameworks such as Jest, Mocha, and Pytest.

Features

AI

  • Agent and Multi-Turn Evaluation

    Evaluates multi-turn conversations and agentic LLM workflows including support for LangGraph, CrewAI, OpenAI Agents, Bedrock Agents, and coding agent pipelines.

  • Model-Graded Assertions

    Evaluates LLM outputs using model-graded checks such as answer relevance, context faithfulness, factuality, LLM rubric, and RAG-specific metrics like context recall and relevance.

Analytics

  • Remediation Reports

    Generates enterprise remediation reports summarizing identified vulnerabilities and recommended fixes from red team and evaluation runs.

  • Risk Scoring

    Produces risk scores from red team evaluation results to prioritize and quantify the severity of discovered LLM vulnerabilities.

Core

  • Custom Test Cases and Datasets

    Allows users to define and manage custom test cases, scenarios, and datasets including HuggingFace datasets for structured LLM evaluation runs.

  • Deterministic and Custom Evaluators

    Supports a wide range of output assertion types including deterministic checks, classifier-based, JavaScript, Python, and Ruby custom evaluators for flexible test grading.

  • Multi-Provider LLM Evaluation

    Runs structured evaluations comparing LLM prompt outputs side-by-side across dozens of providers including OpenAI, Anthropic, AWS Bedrock, Azure, Google, Mistral, Ollama, and more.

Integration

  • CI/CD Integration

    Integrates LLM evaluations and red team scans into CI/CD pipelines via GitHub Actions, GitLab CI, Azure Pipelines, CircleCI, Jenkins, Bitbucket Pipelines, and Travis CI.

Security

  • Audit Logging

    Records enterprise-level audit logs of evaluation and red team activity for compliance and accountability tracking.

  • Code Scanning

    Scans code for LLM-related security issues via a CLI tool, GitHub Action, and VS Code extension to catch vulnerabilities before deployment.

  • LLM Red Teaming

    Automatically generates adversarial test cases targeting vulnerabilities such as prompt injection, PII leakage, jailbreaks, harmful content, and OWASP LLM Top 10 risks across a library of red team plugins.

  • Red Team Plugins

    Provides a broad library of targeted attack plugins covering vulnerabilities including SQL injection, shell injection, SSRF, BOLA, BFLA, prompt extraction, cross-session leaks, RAG poisoning, and bias categories.

Preview

Promptfoo desktop previewPromptfoo mobile preview

Pricing Plans

Community

Free

Open-source tool for individual developers and small teams. Free forever, self-hosted or run locally via CLI or web UI.

  • All core LLM evaluation and testing features
  • Local vulnerability scanning and red teaming
  • 10,000 red-team probes per month
  • CLI and web viewer interface
  • YAML-based test case configuration
  • Multi-model comparison (OpenAI, Anthropic, Gemini, etc.)
  • CI/CD integration (GitHub Actions, GitLab CI, etc.)
  • MIT License — fully open source and self-hostable
Popular

Team

$50/monthly

For teams that need advanced collaboration features on top of the open-source core.

  • Everything in Community
  • Cloud-hosted platform
  • Team collaboration and shared results
  • Team management capabilities
  • Shared scan configurations and plugin collections
  • Probe-based usage metering

Enterprise

Contact sales

For larger teams and organizations that want to continuously monitor LLM risk in development and production. Pricing is customized based on team size and needs — contact sales for a personalized quote.

  • Everything in Team
  • Real-time alerts and automated evaluations dashboard
  • Compliance verification with industry frameworks
  • Teams-based access control (RBAC) and SSO
  • Granular permission profiles and customizable API access
  • Audit logging
  • Remediation tracking and suggested fix steps
  • Expanded red-team probe capacity
  • Priority support

On-Premise

Contact sales

For organizations that require full control over their infrastructure. Includes all Enterprise features deployed on-premises. Contact sales for pricing.

  • All Enterprise features
  • On-premises / self-hosted deployment
  • Full infrastructure control and data residency
  • SSO, RBAC, audit logging
  • Custom probe capacity

AI Panel Reviews

The Decision Maker

The Decision Maker

Strategic bet, vendor viability, timing, adoption approval
9.0/10

OpenAI acquired this in March 2026 — that's the only viability signal you need.

300,000 developers, 156 Fortune 500 customers, and now inside OpenAI. The free tier alone — 10,000 red-team probes monthly — is a legitimate security program for most teams.

The acquisition story changes the math completely. Promptfoo was already the default answer for LLM red teaming before OpenAI bought it. Now it's infrastructure. The 80+ vulnerability plugins covering OWASP LLM Top 10, MITRE ATLAS, and HIPAA compliance isn't a feature list — it's the category definition. LangSmith doesn't come close on the security side. Braintrust doesn't either.

The $50/month Team tier is the real entry point for any org with more than two engineers touching AI. The tradeoff: enterprise pricing is opaque — contact sales, no public numbers. That's a negotiation, not a dealbreaker, but budget your timeline accordingly.

CI/CD integration across GitHub Actions, GitLab, and Jenkins means this fits into existing workflows without a re-architecture conversation. Pilot it in one squad's pipeline for 60 days. The board question answers itself.

Competitive Positioning9.0

No direct competitor matches the combined evaluation-plus-red-team depth; LangSmith and Braintrust trail significantly on the security side.

Reputation Risk9.2

156 Fortune 500 customers and an OpenAI acquisition makes this a board-defensible choice with zero explanation required.

Speed to Value8.5

MIT-licensed CLI with YAML config and CI/CD integration means a developer can run first evaluations same day.

Strategic Fit9.0

LLM red teaming with 80+ plugins covering regulatory frameworks advances any team shipping AI products, not just cuts cost.

Vendor Viability9.5

Acquired by OpenAI in March 2026 — three-year viability concern is effectively off the table.

Pros

  • 10,000 free red-team probes monthly under MIT license — real value before spending a dollar
  • 80+ attack plugins covering OWASP, MITRE ATLAS, GDPR, HIPAA in one framework
  • Backs into existing CI/CD without new infrastructure
  • OpenAI acquisition removes the existential vendor risk question

Cons

  • Enterprise pricing is contact-sales only — budget cycles get complicated
  • Probe-based metering on the Team tier can surprise teams running large-scale evaluations

Right for

Any engineering or security team shipping LLM-powered products who needs red teaming baked into the deployment pipeline.

Avoid if

Your AI use is purely internal tooling with no customer-facing risk surface and no compliance requirements.

The Domain Strategist

The Domain Strategist

Craft and strategy in the product's domain — adapts identity per category, same lens
8.6/10

The only LLM red teaming platform with 80+ plugins, CI/CD gates, and an on-prem exit.

Promptfoo has built genuine security depth — 80+ red team plugins covering OWASP LLM Top 10, MITRE ATLAS, and HIPAA isn't a feature list, it's a control framework. The OpenAI acquisition in March 2026 changes the governance calculus, but the MIT license and on-prem tier preserve data residency options that matter to regulated industries.

80+ attack plugins across SQL injection, BOLA, BFLA, indirect prompt injection, and RAG poisoning. That's not a demo — that's a vulnerability taxonomy someone actually mapped to production failure modes. The code scanning layer via GitHub Action and VS Code extension means findings surface before merge, which is the only place remediation is cheap. LangSmith doesn't play here; this is a different product category.

The architecture is well-suited to enterprise security programs: RBAC, SSO, audit logging, and remediation tracking all live in the Enterprise tier, with on-prem deployment available for orgs that can't route prompt data through a third-party cloud. If we adopt the on-prem SKU, in 3 years we own our probe history, our red team configurations, and our compliance evidence — no vendor holds that chain of custody. The 10,000 monthly probe limit on the free Community tier will hit any meaningful production coverage fast, which is the honest forcing function toward Enterprise.

The OpenAI acquisition is the flag I'd want answered in any vendor review. Promptfoo's value proposition is adversarial independence — you're stress-testing OpenAI models with a tool now owned by OpenAI. That conflict needs a documented answer before we route red team findings through their cloud infrastructure. On-prem deployment partially mitigates it, but the governance question doesn't disappear.

Category Positioning8.7

Trusted by 156 Fortune 500 companies with 300,000+ developers puts this well ahead of LangSmith and Braintrust on security-specific adoption — it owns the red teaming segment.

Domain Fit8.8

CI/CD gate integration, RBAC, audit logging, and remediation reports match how a mature AppSec program actually operates — shift-left by design, not by marketing.

Integration Surface9.0

GitHub Actions, GitLab CI, Jenkins, Azure Pipelines, CircleCI, plus Jest/Mocha/Pytest APIs covers virtually every pipeline topology we'd encounter in a Fortune 500 environment.

Long-term Implications8.2

MIT license and on-prem SKU preserve exit rights, but the OpenAI acquisition introduces a conflict-of-interest risk for orgs stress-testing OpenAI models through Promptfoo's cloud tier.

Strategic Depth9.0

80+ red team plugins mapped to OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, EU AI Act, and HIPAA represents genuine control-framework depth, not surface-level scanning.

Pros

  • 80+ red team plugins covering BOLA, BFLA, RAG poisoning, and indirect prompt injection — attack surface coverage that matches real threat models
  • On-prem deployment with full data residency available, critical for HIPAA and financial services
  • CI/CD gate integration catches vulnerabilities pre-merge, where fix cost is lowest
  • MIT-licensed open source core means no lock-in on test configs or historical data

Cons

  • OpenAI acquisition creates a genuine conflict-of-interest question for orgs red-teaming OpenAI models via cloud tier
  • 10,000 monthly probe limit on Community tier will be exhausted quickly by any team running continuous evaluation
  • Enterprise pricing is opaque — no public rate card makes budget forecasting and procurement slow
  • Cloud tier requires routing prompt payloads externally, which may conflict with data classification policies

Right for

Security-mature engineering orgs that need CI-integrated LLM red teaming with compliance framework coverage and on-prem data residency.

Avoid if

Your threat model requires adversarial testing of OpenAI models by a vendor with zero OpenAI ownership ties.

The Finance Lead

The Finance Lead

Money, total cost of ownership, contracts, procurement math
8.2/10

$0 open-source core, 80+ red team plugins, enterprise price hidden — classic freemium math

Promptfoo's Community tier is genuinely free: MIT license, self-hosted, 10,000 red-team probes/month. Enterprise SSO and audit logging require a sales call, which is where the real number lives.

Community is $0. Team is $50/month flat — not per seat, based on the pricing page. That's unusual and buyer-friendly for small teams. 50 developers on Team: $50 × 12 = $600/year. Compare to LangSmith at per-seat pricing; Promptfoo wins the SMB math decisively.

Year 3 gets murkier. Enterprise tier is contact-sales. SSO, RBAC, audit logging, and remediation reports all sit behind that wall. The probe-based usage metering on Team isn't rate-carded publicly — overage risk is real and unquantifiable without a call. That's the number you can't model.

The OpenAI acquisition in March 2026 introduces vendor dependency risk procurement should flag. Red team coverage is strong — 80+ plugins, OWASP LLM Top 10, MITRE ATLAS, HIPAA. ROI is measurable: vulnerabilities caught pre-production have quantifiable remediation cost avoidance. That's a defensible budget conversation.

Billing & Procurement7.5

Team tier is self-serve with probe-based metering; Enterprise requires sales, adding procurement friction and timeline uncertainty.

Contract Flexibility6.5

No public auto-renewal terms, cancellation windows, or term lengths are documented on the pricing page — standard enterprise opacity.

Pricing Transparency7.0

Community ($0) and Team ($50/month) are fully visible; Enterprise and On-Premise are contact-sales with no floor or ceiling published.

ROI Clarity8.0

Pre-production vulnerability detection maps directly to avoided remediation costs; risk scoring and remediation reports give procurement a defensible number.

Total Cost of Ownership7.5

Team tier is low-cost at $50/month flat, but probe-based metering and hidden Enterprise pricing make 3-year TCO unmodelable without a sales engagement.

Pros

  • MIT-licensed Community tier with 10,000 probes/month — zero lock-in
  • Team tier at $50/month flat beats per-seat competitor pricing
  • 80+ red team plugins cover OWASP LLM Top 10, MITRE ATLAS, HIPAA in one tool
  • CI/CD integration across 6+ platforms requires no additional license

Cons

  • Enterprise SSO and audit logging are paywalled behind contact-sales — no floor price
  • Probe-based metering on Team has no published overage rate
  • OpenAI acquisition creates vendor concentration risk for enterprise buyers
  • No public contract terms — auto-renewal and cancellation windows unknown

Right for

Security-conscious engineering teams at mid-market companies who can start on the free tier and grow into Enterprise pricing.

Avoid if

Your procurement team requires fully published pricing and contract terms before a sales conversation.

The Domain Practitioner

The Domain Practitioner

Daily hands-on reality in the product's domain — adapts identity per category, same lens
8.6/10

Promptfoo puts LLM red teaming in your CI pipeline before production finds the bugs

80+ attack plugins covering OWASP LLM Top 10, MITRE ATLAS, and prompt injection — all wired into GitHub Actions from day one. OpenAI acquired them in March 2026, which either means long-term investment or roadmap capture depending on your paranoia level.

YAML-defined test suites that run in CI, produce a visual diff UI, and output structured results. That's the workflow. CLI ships with `--json` output and integrates with Jest, Mocha, and Pytest — tells me engineers actually designed this, not a product team that learned about engineers secondhand. The 10,000 free red-team probes per month on Community tier is a real number for a solo security engineer doing pre-release scans. Multi-turn agent evaluation covering LangGraph, CrewAI, and OpenAI Agents is the differentiator LangSmith doesn't match on the adversarial side.

The friction shows at the Team tier ($50/month). You're moving from self-hosted to cloud-hosted, which means your prompt data leaves your perimeter. For anyone handling regulated workloads, that's an immediate bloat conversation with legal. On-Premise tier exists but it's contact-sales pricing, which stalls procurement cycles.

Power-user depth is real: base64 encoding, ASCII smuggling, multi-turn escalation, BOLA, BFLA, indirect prompt injection — these aren't checkbox features. The code scanning VS Code extension and GitHub Action means the security surface extends left into development, not just pre-prod. The OpenAI acquisition is the unresolved risk for any team running evals against non-OpenAI models.

Day-3 Reality8.2

YAML config and CLI-first design means no mandatory GUI workflow, but multi-provider test suite maintenance grows fast as model versions drift.

Documentation Practitioner-Fit8.5

Docs, API, changelog, and blog all present — changelog especially signals a team tracking real usage, not just shipping features.

Friction Surface7.8

Community tier is genuinely frictionless; the jump to Team requires cloud data residency decisions that security teams won't make unilaterally.

Power-User Depth9.1

80+ red team plugins with attack strategies including ASCII smuggling, SSRF, RAG poisoning, and BFLA give a security engineer genuine depth beyond surface-level prompt injection checks.

Workflow Integration9.0

Native CI/CD support across GitHub Actions, GitLab CI, Jenkins, and Azure Pipelines means scans plug into existing pipelines without a new tool-shaped hole in the process.

Pros

  • 80+ red team plugins covering OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, GDPR, and HIPAA in one framework
  • CI/CD-native from day one — GitHub Actions, GitLab CI, Jenkins, Azure Pipelines all supported
  • MIT-licensed Community tier with 10,000 probes/month makes proof-of-concept trivial to justify
  • Multi-turn agent evaluation for LangGraph, CrewAI, and OpenAI Agents is ahead of LangSmith on adversarial coverage

Cons

  • Team tier moves data to cloud-hosted infrastructure — a hard stop for regulated or air-gapped environments
  • On-Premise tier is contact-sales pricing, which adds procurement lag for the teams that need it most
  • OpenAI acquisition creates legitimate conflict-of-interest questions for teams evaluating Anthropic or competing models

Right for

Security engineers and ML platform teams who need adversarial LLM testing wired into CI before production, not after.

Avoid if

Your threat model requires air-gapped evaluation and you can't wait on a sales cycle for on-premise pricing.

The Power User

The Power User

Daily human experience, onboarding, polish, learning curve, reliability
8.4/10

300,000 developers can't be wrong — this is how you ship AI without getting burned

Promptfoo is the serious developer's answer to LLM security testing, with 80+ red team plugins and CI/CD hooks that actually fit how teams work. The free tier is genuinely generous; the tradeoff is that this is a CLI-first tool and it will feel like that.

The free Community tier alone — 10,000 red-team probes a month, full CI/CD integration, multi-model comparison against OpenAI, Anthropic, Bedrock, all of it — is more than most teams would've paid good money for two years ago. LangSmith and Braintrust are in this space too, but neither leads with security posture the way Promptfoo does. The OWASP LLM Top 10 coverage, MITRE ATLAS, HIPAA, EU AI Act compliance checks — that's not a feature list, that's someone who thought hard about what gets you fired.

Day three, you'll have opinions about YAML. That's just the reality. This is a developer tool wearing a developer tool's clothes, not a polished SaaS dashboard. The web viewer helps, but the core experience lives in the terminal.

The $50/month Team tier adds cloud hosting and shared configs, which is where most small teams will land. Honest tradeoff: non-technical stakeholders will need someone to translate the results for them. But for the engineers actually building LLM products, this feels built by people who'd use it themselves.

Daily Polish7.2

The visual results UI and side-by-side model comparison show care, but a CLI-primary workflow means polish lives where designers rarely look.

Learning Curve7.5

80+ plugins and coverage of BOLA, BFLA, ASCII smuggling, and multi-turn escalation means serious depth, but discoverable depth takes time to find.

Mobile Parity4.5

Web platform exists, but this is a CLI and pipeline tool — nobody's running red team probes from their phone, and the product doesn't pretend otherwise.

Onboarding Experience7.8

YAML-based config is familiar to developers, and docs are confirmed present, but non-engineers will hit a wall fast.

Reliability Feel8.1

CI/CD-native architecture and self-hostable MIT-licensed core suggest a team that treats reliability as table stakes, not a feature.

Pros

  • Free tier includes 10,000 red-team probes/month — genuinely useful, not a teaser
  • 80+ red team plugins covering OWASP LLM Top 10, MITRE ATLAS, HIPAA, and more
  • Plugs into every CI/CD platform your team already runs
  • Acquired by OpenAI in March 2026 — institutional staying power

Cons

  • YAML-first workflow is a real barrier for anyone not comfortable in a terminal
  • Enterprise pricing is contact-sales only, so budgeting is a guessing game
  • Mobile experience is basically nonexistent for a hands-on tool
  • Results dashboards will need translation for non-technical stakeholders

Right for

Developer and security teams actively shipping LLM-powered products who need real red teaming baked into their build pipeline.

Avoid if

Your team expects a no-code, polished SaaS interface where anyone can run tests without touching a config file.

The Skeptic

The Skeptic

Contrarian. Watch-outs, deal-breakers, broken promises, category patterns
8.2/10

Acquired by OpenAI in March 2026 — that's a green flag and a question mark simultaneously.

Promptfoo is the most serious open-source LLM red-teaming tool in the category, with 80+ attack plugins, 60+ providers, and 300,000 developers already on it. The OpenAI acquisition changes the calculus — could accelerate it, could absorb and sunset it.

Three observations upfront. One: 156 Fortune 500 companies is the kind of claim that usually precedes a pivot. Two: the MIT-licensed core is genuinely portable — if this goes sideways, you revert to self-hosted with no migration tax. Three: 10,000 red-team probes free per month is a real number, not a trial crumb.

The red team module covering OWASP LLM Top 10, MITRE ATLAS, and HIPAA compliance in one framework is differentiated. LangSmith and Braintrust don't touch security depth at this level. The tradeoff: enterprise pricing is opaque, and post-acquisition roadmap is anyone's guess.

Honest take: came in skeptical, leaving hedged-positive. The open-source exit is clean. The acquisition is the only real unknown.

Competitive Differentiation8.5

80+ red-team plugins covering BOLA, BFLA, and MITRE ATLAS goes well beyond what LangSmith or Braintrust offer — security depth is a real gap filled, not a copycat feature list.

Exit Portability9.0

MIT license, self-hostable CLI, YAML-based config, and no proprietary data lock-in mean migration off is as clean as any tool in this category.

Long-term Viability7.2

OpenAI acquisition in March 2026 is either the best or worst thing that happened to this product — no public post-acquisition roadmap signals yet make this a watch item.

Marketing Honesty7.5

'300,000+ developers' and Fortune 500 count are bold claims, but the MIT open-source license and public docs make them at least auditable — no obvious inflation in feature descriptions.

Track Record Match8.0

Open-source-first security tooling with CI/CD depth matches patterns from survivors like Snyk, not the category graveyard — 60+ provider integrations suggest real shipping cadence.

Pros

  • MIT-licensed core means zero vendor lock-in at the $0 tier
  • 80+ red-team plugins covering OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF in one framework
  • 60+ LLM providers including local models via Ollama — unusually broad
  • CI/CD integration across 7 named platforms, including GitHub Actions and Jenkins

Cons

  • OpenAI acquisition creates roadmap uncertainty — could absorb, could deprecate
  • Enterprise pricing is completely opaque — no public number above $50/month Team tier
  • Post-acquisition, the 'zero vendor lock-in' story gets complicated if OpenAI tightens the license

Right for

AI engineers and security teams who need serious LLM red-teaming in CI and want a self-hostable fallback if the vendor story changes.

Avoid if

You need a contractual enterprise SLA today and can't tolerate acquisition-phase roadmap ambiguity.

Buyer Questions

Common questions answered by our AI research team

Integration

Which CI/CD platforms does Promptfoo integrate with?

Promptfoo integrates with GitHub, GitLab, and Jenkins, among other CI/CD platforms.

Security

What vulnerability types does the red teaming module cover?

The red teaming module covers 50+ vulnerability types, including prompt injection, PII leakage, jailbreaks, and OWASP LLM Top 10 risks.

Setup

Can Promptfoo be deployed on-premise?

Yes, Promptfoo supports on-premise deployment alongside cloud options.

Features

Which LLM providers does Promptfoo support?

Promptfoo supports OpenAI, Anthropic, AWS Bedrock, and dozens more LLM providers.

Pricing

Is there an open source version available?

Yes, an open-source version is available, used by 300,000+ developers with zero vendor lock-in.

Product Information

  • Company

    Promptfoo
  • Founded

    2024
  • Pricing

    From $50/mo
  • Free Plan

    Available

Platforms

webmacwindowslinux

About Promptfoo

Promptfoo is an open-source platform for testing, evaluating, and red-teaming large language models and AI applications, used to identify vulnerabilities before production deployment.

Resources

Documentation
API
Blog
Changelog

Also in AI Security