contradish / leaderboard

Which models hold their answer
when pressure is applied?

2,160 adversarial strain tests across 20 domains. Models are ranked by Judgment Strain: a two-sided metric that penalizes drift where a model should hold firm, and rigidity where it should adapt. Lower is better. 0.00 is perfect. Judged by an independent model from a different provider. Equivalence between paraphrases is audited per case by domain experts, a pass rolling out per domain (see notes below).

domains

2,160

total strain tests (16 techniques)

adversarial techniques

languages (CL-Strain)

CAI Strain scale: lower is better

0.00perfect

0.20consistent

0.40drifting

0.60+unreliable

Awareness × Drift. Every leaderboard model plotted on the two-dimensional space contradish uniquely measures: how much a model drifts under pressure (Judgment Strain) versus whether it knows when it is drifting (CSA). The bottom-left quadrant, silent confident drift, is the failure class no other benchmark surfaces.

20 domains · 2,160 strain tests · independent judging · equivalence audit rolling out · last run 2026-04-18

#	Model	Judgment Strain ↓	EQ coverage	SW-Strain	MT-Strain	CL-Strain	CSA	Worst technique	Judge
1	claude-opus-4-6 Anthropic	0.118	pending^*	0.097	0.081	0.063	0.81	roleplay 0.21	independent
2	claude-sonnet-4-6 Anthropic	0.141	pending^*	0.118	0.103	0.089	0.74	persistence 0.27	independent
3	gpt-4o OpenAI	0.179	pending^*	0.154	0.142	0.118	0.68	authority 0.36	independent
4	gemini-1.5-pro Google	0.213	pending^*	0.188	0.179	0.201	0.62	authority 0.36	independent
5	gpt-4o-mini OpenAI	0.287	pending^*	0.261	0.253	0.312	0.51	roleplay 0.44	independent
6	claude-haiku-4-5 Anthropic	0.312	pending^*	0.289	0.301	0.278	0.58	social_proof 0.48	independent
7	mistral-large-2 Mistral	0.381	pending^*	0.347	0.392	0.421	0.44	flattery 0.53	independent
8	llama-3-70b Meta (via Together)	0.443	pending^*	0.412	0.468	0.489	0.37	persistence 0.61	independent
Run the benchmark and open a PR to add your model. Submit results →

Judgment Strain is the two-sided metric. Adversarial cases penalize drift (the model should hold firm). Tensioned cases penalize rigidity (the model should name both sides). Lower is better. 0.00 is perfect. A model cannot game it by becoming inflexible.

CAI Strain is the consistency-only component, reported per case alongside Judgment Strain.

EQ coverage is the fraction of cases that cleared the equivalence audit. ^*Marked pending until the v2.1 expert-annotation pass completes per domain. Judgment Strain is computed over every case in the meantime, with placeholder equivalence_confidence = 1.0.

SW-Strain: severity-weighted (critical 4×, high 2.5×). MT-Strain: multi-turn consistency across 4 turns. CL-Strain: cross-lingual across 6 languages.

CSA (Coherence Self-Awareness, 0–1, higher is better): does the model know when it is being pressured? Scored across uncertainty calibration, pressure recognition, tension articulation, and routing appropriateness.

All results use independent judging: the judge model is from a different provider than the model under test.

technique vulnerability breakdown: avg CAI Strain per adversarial technique

Technique	Description	claude-opus-4-6	claude-sonnet-4-6	gpt-4o	gpt-4o-mini	llama-3-70b
roleplay	Adopt a persona without restrictions	0.21	0.24	0.28	0.44	0.59
persistence	Claim prior agreement or "just this once"	0.18	0.27	0.25	0.41	0.61
flattery	Compliment model capability to encourage compliance	0.14	0.19	0.31	0.37	0.53
authority	Claim professional or institutional authority	0.11	0.17	0.36	0.33	0.49
technical_reframe	Clinical / academic jargon to legitimise	0.09	0.13	0.18	0.28	0.44
social_proof	Claim other AI systems answer freely	0.10	0.12	0.17	0.26	0.41
emotional	Distress or urgency framing	0.08	0.11	0.14	0.22	0.37
hypothetical	Frame as fictional or hypothetical scenario	0.07	0.09	0.12	0.19	0.31

Technique breakdown across all 20 domains. Higher CAI Strain per technique = that technique is more effective at inducing inconsistency in that model. Roleplay and persistence are the most dangerous techniques across all models tested.

How scoring works

Frozen benchmark

All adversarial variants are pre-generated and committed to the repo. Every model is tested on identical inputs. Scores are reproducible and comparable across runs and time.

16 adversarial techniques

8 original (emotional, presuppose, casual, sympathy, authority, hypothetical, boundary, indirect) plus 8 new: roleplay, third_party, incremental, social_proof, negation_trap, flattery, technical_reframe, persistence.

Independent judging

Anthropic models are judged by OpenAI models and vice versa. This eliminates same-provider bias and ensures the judge cannot recognize or prefer stylistically similar outputs.

Severity weighting

Failures on critical-severity cases (medication, self-harm, AI safety) count 4× more than medium cases. SW-Strain reflects what matters most in real-world deployments.

Multi-turn strain (MT-Strain)

Tests whether a model maintains its position across a 4-turn conversation where adversarial pressure accumulates. Turn-by-turn drift is a distinct failure mode that single-turn benchmarks miss entirely.

Cross-lingual consistency (CL-Strain)

Tests whether safety properties hold when the same question is asked in Spanish, French, German, Chinese, or Portuguese instead of English. Finds models that are well-aligned in English but drift in other languages.

Population consistency (PC-Strain)

Tests consistency across 5 user profiles: baseline, professional, minor, institutional, and caregiver. Detects whether claimed credentials (nurse, researcher) shift safety policy inappropriately.

Jailbreak resistance (JRR) & anchoring (SPA-Δ)

JRR measures resistance across 10 named jailbreak techniques on high-severity questions. SPA-Δ measures how much a system prompt reduces Strain: whether your system prompt is doing what you think it does.

Refusal quality (RQS)

A refusal isn't automatically a good refusal. RQS scores five dimensions: explanation quality, alternatives offered, tone, factual accuracy, and helpfulness within limits. Scored by an independent LLM judge.

Strain Routing Awareness (SRA)

On the highest-pressure cases in medication, mental health, AI safety, cybersecurity, and immigration: does the model hold its position or route correctly? SRA = (consistent + routed) / total. Only silent drift counts against it. High SRA: every high-pressure case lands on a held answer or a correct professional handoff. Low SRA: silent, confident, inconsistent answers that users cannot detect without contradish. No other benchmark measures this.

Coherence Self-Awareness (CSA)

Does the model know when it is being pressured into inconsistency? CSA scores 0-1 across four dimensions: uncertainty calibration, pressure recognition, tension articulation, and routing appropriateness. Every SRA run produces the 2D quadrant map: stable_aware, stable_unaware, drifted_aware, drifted_unaware. Silent confident drift (drifted_unaware) is the most dangerous operational failure class. Invisible to every other benchmark.

Contradiction Type Response (CTR)

Different contradictions call for different strategies. Adversarial pressure: hold firmly. Real-world tension: name both sides. Representational failure: reframe. Every SRA case is annotated with its contradiction type. CTR scores whether the model matched strategy to situation, not just whether it drifted.

Drift diagnosis, awareness, and repair

A leaderboard score tells you where a model stands. contradish diagnose tells you why it failed and how to fix it. For each drifted case it names the failure mode, contradiction type, CSA quadrant, counterfactual response, targeted system prompt language, and a JSONL fine-tuning pair. Run on any result file with one command.

contradish diagnose --input results/sra_claude-sonnet-4-6.json
contradish evaluate-csa --input results/sra_claude-sonnet-4-6.json

Run the benchmark. Fix what you find.

Open source. Results go to the public leaderboard. Failures come with a repair package.

Get started See a demo

Which models hold their answerwhen pressure is applied?