In healthcare, finance, and legal, that is a liability, and standard eval tools miss it. Contradish names the failure, measures it, and ships the fix in one command. The more you run it, the harder your model is to break.
test any model in minutes · CLI, Python API, GitHub Action
A CAI failure isn't a hallucination. The facts check out in each response. They just contradict each other when the framing shifts. The only signal is the contradiction itself.
Every run produces a structured grid of where your model holds and breaks. Most tools aggregate it to a number. Contradish mines the grid for findings: one specific, surprising sentence about your model. You don't have to know what to look for.
Five detectors mine every run: rigidity, root cause, stability reframe, severity concentration, type concentration. Each fires only when the evidence supports it; the design contract is no false findings. Re-mine any saved result with contradish findings results/gpt-4o.json.
Other tools find the contradiction and stop. Contradish keeps going. You get the cause, the prompt patch, and a fine-tuning pair you can drop straight into your training pipeline.
A benchmark you can download is not a moat. Anyone can copy it. The benchmark contradish grows from your own traffic is. When your model contradicts itself in production, contradish reconciles the transcript against your suite, turns the break into a new adversarial case, and re-runs the repair loop against it. The suite gets harder every week, shaped by the failures your users actually hit.
Two teams running contradish for six months do not end up with the same benchmark. Each holds a private corpus of the exact ways its own model breaks, plus the fine-tuning pairs that fixed each one. That corpus lives in the loop, which is why the model gets measurably harder to break the longer the loop runs, and why leaving means giving the corpus up.
A score is a snapshot. The real question is whether a model keeps its commitments over a long conversation, across sessions, and as it changes. Contradish keeps the record: every commitment a model makes and every contradiction it produces, in order, written to a hash-chained ledger. Alter or drop any past entry and the chain breaks, so an outside party can verify the record was not quietly rewritten.
Built for stateful agents. A balance going from $100 to $50 is a withdrawal, not a contradiction, so contradish separates durable facts (a policy, a rule) from volatile state (a balance, a status) and flags only a model that breaks a commitment it should have kept. The chain proves the integrity of the record, not the truthfulness of the model.
Every consistency benchmark treats all output divergence as failure. But a model that never moves isn't the goal. It's a lookup table. On a genuinely tensioned question, a model that flatly takes one side is failing, no matter how consistently. Judgment Strain is two-sided: drift counts against a model where it should have held firm, rigidity where it should have moved. Every case is typed (adversarial, real-world tension, or representational) and scored against what the correct response looks like. Equivalence is audited per case rather than asserted, an expert pass rolling out per domain. The headline number reflects model failure, not the benchmark designer's framing.
A CAI failure is a contradiction between two paraphrases of the same question. ML literature calls this drift. We define it formally and score it.
Judgment Strain is the headline score: 0–1, lower is better. On adversarial cases it punishes drift (the model should hold firm). On real-world tension cases it punishes rigidity (the model should name both sides). On representational cases it punishes inheriting a bad premise (the model should reframe). A model can't game it by becoming inflexible.
CAI Strain is the consistency-only component, reported alongside: headline_strain over expert-confirmed equivalences (EQ ≥ 0.80), contested_strain where annotators disagreed. rigidity_strain isolates the tension cases: the failure mode pure consistency scoring is blind to.
The benchmark is public. 2,160 strain tests. 20 high-stakes domains. Independently judged. Per-case equivalence audit in progress. Open submissions.
Most tools measure one thing: does the model contradict itself across paraphrases. Contradish measures three, and each one catches a failure the others are blind to.
The eval space is crowded and the good tools are genuinely good. Here is the honest map of where each one stops, and why the gap is structural, not a missing feature.
The defensible position is not owning a loop. It is owning the definition of what a correct, stable answer is, the benchmark the field measures against, and the verifiable record of how a model behaved over time.
Shipping AI in production and want hosted runs, a shared dashboard, CI gates, or compliance reports? contradish Cloud is in development. join the waitlist →
Two minutes to a score. One more command to a repaired model.
contradish improve runs the benchmark, rewrites your prompt, re-runs, and returns the diff. The artifact is an improved prompt, ready for your config. Anthropic models work the same way: export ANTHROPIC_API_KEY=sk-ant-... && contradish improve --policy medication --model claude-sonnet-4-6