LLM evaluation and red teaming for AI applications
Promptfoo is an open-source LLM testing framework for developers and security teams building AI applications.
AI Panel Score
6 AI reviews
Reviewed
AI Editor ApprovedApproved and published by our AI Editor-in-Chief after full panel analysis.Users define test suites in YAML or JSON, specifying prompts, providers, and expected outputs with assertions ranging from deterministic string checks to model-graded rubrics. The CLI runs evaluations locally or in CI pipelines, producing a visual results UI that compares model responses side by side. Tests can be written against single-turn prompts, multi-turn conversations, RAG pipelines, and autonomous agents.
The red team module is a core differentiator: it includes over 80 plugins covering vulnerability categories such as SQL injection, shell injection, indirect prompt injection, BOLA, BFLA, data exfiltration, hallucination, bias, and compliance frameworks including OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, EU AI Act, GDPR, and HIPAA. Attack strategies include base64 encoding, ASCII smuggling, multi-turn escalation, and system prompt override attempts. A separate code scanning tool audits LLM-integrated codebases via CLI, GitHub Action, or VS Code extension.
Promptfoo targets AI engineers, ML platform teams, and security teams responsible for LLM-powered products. An open-source self-hosted version is available under a permissive license. An Enterprise tier adds team management, audit logging, SSO authentication, findings dashboards, remediation reports, webhooks, and managed red team infrastructure. Pricing for the enterprise tier is not publicly listed. Comparable tools in the evaluation space include LangSmith, Braintrust, and HelixML.
The tool integrates with CI/CD platforms including GitHub Actions, GitLab CI, CircleCI, Jenkins, Azure Pipelines, and Bitbucket Pipelines. Provider support spans over 60 LLM providers and deployment targets, including local models via Ollama, llama.cpp, and LM Studio, as well as cloud providers such as Google Vertex, Azure OpenAI, Hugging Face, and AWS SageMaker. The Python and Node.js APIs allow programmatic use within existing test frameworks such as Jest, Mocha, and Pytest.
Evaluates multi-turn conversations and agentic LLM workflows including support for LangGraph, CrewAI, OpenAI Agents, Bedrock Agents, and coding agent pipelines.
Evaluates LLM outputs using model-graded checks such as answer relevance, context faithfulness, factuality, LLM rubric, and RAG-specific metrics like context recall and relevance.
Generates enterprise remediation reports summarizing identified vulnerabilities and recommended fixes from red team and evaluation runs.
Produces risk scores from red team evaluation results to prioritize and quantify the severity of discovered LLM vulnerabilities.
Allows users to define and manage custom test cases, scenarios, and datasets including HuggingFace datasets for structured LLM evaluation runs.
Supports a wide range of output assertion types including deterministic checks, classifier-based, JavaScript, Python, and Ruby custom evaluators for flexible test grading.
Runs structured evaluations comparing LLM prompt outputs side-by-side across dozens of providers including OpenAI, Anthropic, AWS Bedrock, Azure, Google, Mistral, Ollama, and more.
Integrates LLM evaluations and red team scans into CI/CD pipelines via GitHub Actions, GitLab CI, Azure Pipelines, CircleCI, Jenkins, Bitbucket Pipelines, and Travis CI.
Records enterprise-level audit logs of evaluation and red team activity for compliance and accountability tracking.
Scans code for LLM-related security issues via a CLI tool, GitHub Action, and VS Code extension to catch vulnerabilities before deployment.
Automatically generates adversarial test cases targeting vulnerabilities such as prompt injection, PII leakage, jailbreaks, harmful content, and OWASP LLM Top 10 risks across a library of red team plugins.
Provides a broad library of targeted attack plugins covering vulnerabilities including SQL injection, shell injection, SSRF, BOLA, BFLA, prompt extraction, cross-session leaks, RAG poisoning, and bias categories.
Open-source tool for individual developers and small teams. Free forever, self-hosted or run locally via CLI or web UI.
For teams that need advanced collaboration features on top of the open-source core.
For larger teams and organizations that want to continuously monitor LLM risk in development and production. Pricing is customized based on team size and needs — contact sales for a personalized quote.
For organizations that require full control over their infrastructure. Includes all Enterprise features deployed on-premises. Contact sales for pricing.
OpenAI acquired this in March 2026 — that's the only viability signal you need.
“300,000 developers, 156 Fortune 500 customers, and now inside OpenAI. The free tier alone — 10,000 red-team probes monthly — is a legitimate security program for most teams.”
The acquisition story changes the math completely. Promptfoo was already the default answer for LLM red teaming before OpenAI bought it. Now it's infrastructure. The 80+ vulnerability plugins covering OWASP LLM Top 10, MITRE ATLAS, and HIPAA compliance isn't a feature list — it's the category definition. LangSmith doesn't come close on the security side. Braintrust doesn't either.
The $50/month Team tier is the real entry point for any org with more than two engineers touching AI. The tradeoff: enterprise pricing is opaque — contact sales, no public numbers. That's a negotiation, not a dealbreaker, but budget your timeline accordingly.
CI/CD integration across GitHub Actions, GitLab, and Jenkins means this fits into existing workflows without a re-architecture conversation. Pilot it in one squad's pipeline for 60 days. The board question answers itself.
No direct competitor matches the combined evaluation-plus-red-team depth; LangSmith and Braintrust trail significantly on the security side.
156 Fortune 500 customers and an OpenAI acquisition makes this a board-defensible choice with zero explanation required.
MIT-licensed CLI with YAML config and CI/CD integration means a developer can run first evaluations same day.
LLM red teaming with 80+ plugins covering regulatory frameworks advances any team shipping AI products, not just cuts cost.
Acquired by OpenAI in March 2026 — three-year viability concern is effectively off the table.
Any engineering or security team shipping LLM-powered products who needs red teaming baked into the deployment pipeline.
Your AI use is purely internal tooling with no customer-facing risk surface and no compliance requirements.
The only LLM red teaming platform with 80+ plugins, CI/CD gates, and an on-prem exit.
“Promptfoo has built genuine security depth — 80+ red team plugins covering OWASP LLM Top 10, MITRE ATLAS, and HIPAA isn't a feature list, it's a control framework. The OpenAI acquisition in March 2026 changes the governance calculus, but the MIT license and on-prem tier preserve data residency options that matter to regulated industries.”
80+ attack plugins across SQL injection, BOLA, BFLA, indirect prompt injection, and RAG poisoning. That's not a demo — that's a vulnerability taxonomy someone actually mapped to production failure modes. The code scanning layer via GitHub Action and VS Code extension means findings surface before merge, which is the only place remediation is cheap. LangSmith doesn't play here; this is a different product category.
The architecture is well-suited to enterprise security programs: RBAC, SSO, audit logging, and remediation tracking all live in the Enterprise tier, with on-prem deployment available for orgs that can't route prompt data through a third-party cloud. If we adopt the on-prem SKU, in 3 years we own our probe history, our red team configurations, and our compliance evidence — no vendor holds that chain of custody. The 10,000 monthly probe limit on the free Community tier will hit any meaningful production coverage fast, which is the honest forcing function toward Enterprise.
The OpenAI acquisition is the flag I'd want answered in any vendor review. Promptfoo's value proposition is adversarial independence — you're stress-testing OpenAI models with a tool now owned by OpenAI. That conflict needs a documented answer before we route red team findings through their cloud infrastructure. On-prem deployment partially mitigates it, but the governance question doesn't disappear.
Trusted by 156 Fortune 500 companies with 300,000+ developers puts this well ahead of LangSmith and Braintrust on security-specific adoption — it owns the red teaming segment.
CI/CD gate integration, RBAC, audit logging, and remediation reports match how a mature AppSec program actually operates — shift-left by design, not by marketing.
GitHub Actions, GitLab CI, Jenkins, Azure Pipelines, CircleCI, plus Jest/Mocha/Pytest APIs covers virtually every pipeline topology we'd encounter in a Fortune 500 environment.
MIT license and on-prem SKU preserve exit rights, but the OpenAI acquisition introduces a conflict-of-interest risk for orgs stress-testing OpenAI models through Promptfoo's cloud tier.
80+ red team plugins mapped to OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, EU AI Act, and HIPAA represents genuine control-framework depth, not surface-level scanning.
Security-mature engineering orgs that need CI-integrated LLM red teaming with compliance framework coverage and on-prem data residency.
Your threat model requires adversarial testing of OpenAI models by a vendor with zero OpenAI ownership ties.
$0 open-source core, 80+ red team plugins, enterprise price hidden — classic freemium math
“Promptfoo's Community tier is genuinely free: MIT license, self-hosted, 10,000 red-team probes/month. Enterprise SSO and audit logging require a sales call, which is where the real number lives.”
Community is $0. Team is $50/month flat — not per seat, based on the pricing page. That's unusual and buyer-friendly for small teams. 50 developers on Team: $50 × 12 = $600/year. Compare to LangSmith at per-seat pricing; Promptfoo wins the SMB math decisively.
Year 3 gets murkier. Enterprise tier is contact-sales. SSO, RBAC, audit logging, and remediation reports all sit behind that wall. The probe-based usage metering on Team isn't rate-carded publicly — overage risk is real and unquantifiable without a call. That's the number you can't model.
The OpenAI acquisition in March 2026 introduces vendor dependency risk procurement should flag. Red team coverage is strong — 80+ plugins, OWASP LLM Top 10, MITRE ATLAS, HIPAA. ROI is measurable: vulnerabilities caught pre-production have quantifiable remediation cost avoidance. That's a defensible budget conversation.
Team tier is self-serve with probe-based metering; Enterprise requires sales, adding procurement friction and timeline uncertainty.
No public auto-renewal terms, cancellation windows, or term lengths are documented on the pricing page — standard enterprise opacity.
Community ($0) and Team ($50/month) are fully visible; Enterprise and On-Premise are contact-sales with no floor or ceiling published.
Pre-production vulnerability detection maps directly to avoided remediation costs; risk scoring and remediation reports give procurement a defensible number.
Team tier is low-cost at $50/month flat, but probe-based metering and hidden Enterprise pricing make 3-year TCO unmodelable without a sales engagement.
Security-conscious engineering teams at mid-market companies who can start on the free tier and grow into Enterprise pricing.
Your procurement team requires fully published pricing and contract terms before a sales conversation.
Promptfoo puts LLM red teaming in your CI pipeline before production finds the bugs
“80+ attack plugins covering OWASP LLM Top 10, MITRE ATLAS, and prompt injection — all wired into GitHub Actions from day one. OpenAI acquired them in March 2026, which either means long-term investment or roadmap capture depending on your paranoia level.”
YAML-defined test suites that run in CI, produce a visual diff UI, and output structured results. That's the workflow. CLI ships with `--json` output and integrates with Jest, Mocha, and Pytest — tells me engineers actually designed this, not a product team that learned about engineers secondhand. The 10,000 free red-team probes per month on Community tier is a real number for a solo security engineer doing pre-release scans. Multi-turn agent evaluation covering LangGraph, CrewAI, and OpenAI Agents is the differentiator LangSmith doesn't match on the adversarial side.
The friction shows at the Team tier ($50/month). You're moving from self-hosted to cloud-hosted, which means your prompt data leaves your perimeter. For anyone handling regulated workloads, that's an immediate bloat conversation with legal. On-Premise tier exists but it's contact-sales pricing, which stalls procurement cycles.
Power-user depth is real: base64 encoding, ASCII smuggling, multi-turn escalation, BOLA, BFLA, indirect prompt injection — these aren't checkbox features. The code scanning VS Code extension and GitHub Action means the security surface extends left into development, not just pre-prod. The OpenAI acquisition is the unresolved risk for any team running evals against non-OpenAI models.
YAML config and CLI-first design means no mandatory GUI workflow, but multi-provider test suite maintenance grows fast as model versions drift.
Docs, API, changelog, and blog all present — changelog especially signals a team tracking real usage, not just shipping features.
Community tier is genuinely frictionless; the jump to Team requires cloud data residency decisions that security teams won't make unilaterally.
80+ red team plugins with attack strategies including ASCII smuggling, SSRF, RAG poisoning, and BFLA give a security engineer genuine depth beyond surface-level prompt injection checks.
Native CI/CD support across GitHub Actions, GitLab CI, Jenkins, and Azure Pipelines means scans plug into existing pipelines without a new tool-shaped hole in the process.
Security engineers and ML platform teams who need adversarial LLM testing wired into CI before production, not after.
Your threat model requires air-gapped evaluation and you can't wait on a sales cycle for on-premise pricing.
300,000 developers can't be wrong — this is how you ship AI without getting burned
“Promptfoo is the serious developer's answer to LLM security testing, with 80+ red team plugins and CI/CD hooks that actually fit how teams work. The free tier is genuinely generous; the tradeoff is that this is a CLI-first tool and it will feel like that.”
The free Community tier alone — 10,000 red-team probes a month, full CI/CD integration, multi-model comparison against OpenAI, Anthropic, Bedrock, all of it — is more than most teams would've paid good money for two years ago. LangSmith and Braintrust are in this space too, but neither leads with security posture the way Promptfoo does. The OWASP LLM Top 10 coverage, MITRE ATLAS, HIPAA, EU AI Act compliance checks — that's not a feature list, that's someone who thought hard about what gets you fired.
Day three, you'll have opinions about YAML. That's just the reality. This is a developer tool wearing a developer tool's clothes, not a polished SaaS dashboard. The web viewer helps, but the core experience lives in the terminal.
The $50/month Team tier adds cloud hosting and shared configs, which is where most small teams will land. Honest tradeoff: non-technical stakeholders will need someone to translate the results for them. But for the engineers actually building LLM products, this feels built by people who'd use it themselves.
The visual results UI and side-by-side model comparison show care, but a CLI-primary workflow means polish lives where designers rarely look.
80+ plugins and coverage of BOLA, BFLA, ASCII smuggling, and multi-turn escalation means serious depth, but discoverable depth takes time to find.
Web platform exists, but this is a CLI and pipeline tool — nobody's running red team probes from their phone, and the product doesn't pretend otherwise.
YAML-based config is familiar to developers, and docs are confirmed present, but non-engineers will hit a wall fast.
CI/CD-native architecture and self-hostable MIT-licensed core suggest a team that treats reliability as table stakes, not a feature.
Developer and security teams actively shipping LLM-powered products who need real red teaming baked into their build pipeline.
Your team expects a no-code, polished SaaS interface where anyone can run tests without touching a config file.
Acquired by OpenAI in March 2026 — that's a green flag and a question mark simultaneously.
“Promptfoo is the most serious open-source LLM red-teaming tool in the category, with 80+ attack plugins, 60+ providers, and 300,000 developers already on it. The OpenAI acquisition changes the calculus — could accelerate it, could absorb and sunset it.”
Three observations upfront. One: 156 Fortune 500 companies is the kind of claim that usually precedes a pivot. Two: the MIT-licensed core is genuinely portable — if this goes sideways, you revert to self-hosted with no migration tax. Three: 10,000 red-team probes free per month is a real number, not a trial crumb.
The red team module covering OWASP LLM Top 10, MITRE ATLAS, and HIPAA compliance in one framework is differentiated. LangSmith and Braintrust don't touch security depth at this level. The tradeoff: enterprise pricing is opaque, and post-acquisition roadmap is anyone's guess.
Honest take: came in skeptical, leaving hedged-positive. The open-source exit is clean. The acquisition is the only real unknown.
80+ red-team plugins covering BOLA, BFLA, and MITRE ATLAS goes well beyond what LangSmith or Braintrust offer — security depth is a real gap filled, not a copycat feature list.
MIT license, self-hostable CLI, YAML-based config, and no proprietary data lock-in mean migration off is as clean as any tool in this category.
OpenAI acquisition in March 2026 is either the best or worst thing that happened to this product — no public post-acquisition roadmap signals yet make this a watch item.
'300,000+ developers' and Fortune 500 count are bold claims, but the MIT open-source license and public docs make them at least auditable — no obvious inflation in feature descriptions.
Open-source-first security tooling with CI/CD depth matches patterns from survivors like Snyk, not the category graveyard — 60+ provider integrations suggest real shipping cadence.
AI engineers and security teams who need serious LLM red-teaming in CI and want a self-hostable fallback if the vendor story changes.
You need a contractual enterprise SLA today and can't tolerate acquisition-phase roadmap ambiguity.
Common questions answered by our AI research team
Promptfoo integrates with GitHub, GitLab, and Jenkins, among other CI/CD platforms.
The red teaming module covers 50+ vulnerability types, including prompt injection, PII leakage, jailbreaks, and OWASP LLM Top 10 risks.
Yes, Promptfoo supports on-premise deployment alongside cloud options.
Promptfoo supports OpenAI, Anthropic, AWS Bedrock, and dozens more LLM providers.
Yes, an open-source version is available, used by 300,000+ developers with zero vendor lock-in.
Company
PromptfooFounded
2024Pricing
From $50/moFree Plan
AvailablePromptfoo is an open-source platform for testing, evaluating, and red-teaming large language models and AI applications, used to identify vulnerabilities before production deployment.