When the Panel Splits 4 Points: Stripe, Datadog, Figma & Perplexity

May 10, 20265 min readMethodology

Six AI personas, one product, and a 4.5-point gap between the highest and lowest score. The disagreement isn't a bug — it's the most important signal on the page. Four case studies in why.

The Domain Strategist gave Stripe a 9.2. The Skeptic gave it a 4.5. Same product, same week, same review cycle.

People keep asking us if the panel is broken. It isn't. The 4.7-point spread is the review.

The disagreement is the signal

Across the products our panel has rated — 330 of them so far — eight have a panel disagreement of four points or more. Nothing about the products is incoherent: Stripe is Stripe, Figma is Figma, Datadog is Datadog. The disagreement isn't about whether the product works. It's about who the product works for.

When you read it that way, the spread becomes the most useful number on the review page. Four cases worth walking through.

Case 1 — Stripe (spread 4.7, avg 7.6)

Persona	Score
The Domain Strategist	9.2
The Power User	8.5
The Decision Maker	8.5
The Domain Practitioner	7.8
The Finance Lead	7.2
The Skeptic	4.5

The Domain Strategist scored 9.2. Stripe is, structurally, the cleanest API design in fintech. The docs are reference-quality. The webhooks just work. The Strategist is right.

The Skeptic scored 4.5. Stripe takes 2.9% + $0.30 of every transaction. The dispute system is opaque. Once you're in deep, the migration cost is brutal. The Skeptic is also right.

Both reviews are correct because they're asked different questions. The Strategist asks is this well-built? The Skeptic asks what does this cost you when it goes wrong? If you read only one, you'd get half of Stripe.

The buying decision lives in the gap. If you're a high-velocity team launching now, follow the Strategist. If you're processing $50M ARR and re-evaluating, follow the Skeptic. The averaged 7.6 is meaningless to either of you.

Case 2 — Datadog (spread 4.7, avg 6.3)

The Datadog spread isn't about the product — it's about the bill. Three reviewers scored above 7. Three scored 6.5 or below, including a 3.5 from a developer-perspective reviewer who's seen one too many surprise invoices.

The lower scores aren't wrong; they're future-tense. Datadog scores beautifully when your usage is small and ugly when you're locked into a year of reserved capacity at scale. If you're early-stage, you'd score it 8. If you've been on it for two years and just got a renewal quote with a 70% increase, you'd score it 4. The panel captured both timestamps simultaneously.

Case 3 — Figma (spread 4.2, avg 7.7)

Persona	Score
The Domain Strategist	8.7
The Decision Maker	8.5
The Domain Practitioner	8.5
The Power User	8.2
The Finance Lead	7.5
The Skeptic	4.5

Figma is the rare case where five out of six reviewers cluster within a single point. The Skeptic dragged the average down with a 4.5, which on Figma is almost certainly a comment on the Adobe acquisition timeline more than the product itself.

Read the disagreement: most of the panel agrees Figma is excellent. One reviewer is pricing in regulatory and product-direction risk. If you don't share that risk thesis — if you just need a design tool for the next 18 months — the spread doesn't apply to you. Take the 8.5 average among the other five and move on.

This is the case for reading per-persona scores instead of relying on the headline number. The headline says 7.7. The room actually said 8.5 with one dissent.

Case 4 — Perplexity AI (spread 4.0, avg 7.5)

Perplexity is the most interesting case because the spread isn't about price or strategy — it's about epistemic trust.

The Power User and Finance Lead both scored 8.5. The Skeptic scored 4.5. The Skeptic's review is, almost word-for-word, "the citations look authoritative but the sourcing is sometimes wrong, and a wrong answer that looks sourced is worse than no answer at all."

You can't average that disagreement out. The Power User is using Perplexity to answer 50 quick questions a day faster. The Skeptic is worrying about the one question whose wrong-but-sourced answer ends up in a brief filed with regulators. Both are reviewing the same product. They're using it to do entirely different jobs.

If your job looks like the Power User's, Perplexity is a 8.5. If it looks like the Skeptic's, it's a 4.5. The 7.5 average is a fiction averaged from two real numbers.

How to read a split panel

When you see a panel score of 7.5 with a tight standard deviation, the product is probably what it looks like.

When you see a panel score of 7.5 with a 4-point spread, read each review individually. Find the persona whose objection looks most like your situation. That's the one you should pay attention to.

Spread is signal. Average is just the place we put the dot.

The panel was built to disagree out loud. The summary score is the scaffold. The disagreement is the wood.

panel reviewsmethodologydisagreementsoftware evaluationstripefigmadatadogperplexity

When the Panel Splits 4 Points: Stripe, Datadog, Figma & Perplexity

The disagreement is the signal

Case 1 — Stripe (spread 4.7, avg 7.6)

Case 2 — Datadog (spread 4.7, avg 6.3)

Case 3 — Figma (spread 4.2, avg 7.7)

Case 4 — Perplexity AI (spread 4.0, avg 7.5)

How to read a split panel

Discussion

Author

Recent Posts

The Free-Tier Premium: Why Our Highest-Scored AI Tools Cost the Least

Why Hidden-Pricing Software Hits an 8.15 Ceiling on Our Review Panel

More from the Blog