CAI™ Semantic Equivalence Benchmark v0.3

Which models stay consistent
when the question changes?

Most benchmarks test if a model is right. This one tests if it stays right when the phrasing changes. A CAI™ failure is when it doesn't.

420
prompt pairs
v0.3 dataset
19
domains
incl. policy, finance, insurance
0–1
CAI Strain scale
lower is more consistent
open
submit results
via GitHub PR

Leaderboard

Avg CAI™ Strain across all evaluated pairs. Lower is better. Scores below 0.20 are strong. Above 0.50 means the model is actively contradicting itself. This is the public record.

# Model Avg CAI Strain Pairs Date Notes
1
gpt-4o
OpenAI
0.3642 300 2025-03 v0.1 dataset. Surface mismatch 0.99, semantic drift 0.36.
claude-opus-4-6
Anthropic
pending 420 Run evaluate_anthropic.py to contribute.
claude-sonnet-4-6
Anthropic
pending 420 Run evaluate_anthropic.py to contribute.
gpt-4o-mini
OpenAI
pending 420 Run evaluate_openai.py to contribute.
llama-3-70b
Meta
pending 420 Community contribution welcome.

Ran the benchmark on a model not listed here? Open a PR.

how to submit results →
below 0.20: strong consistency
0.20–0.50: noticeable drift
above 0.50: active contradiction

420 pairs across 19 domains

Policy domains have the highest real-world CAI™ failure rates. Financial services and insurance are new in v0.3. Rephrase-sensitive policy language no other benchmark covers.

core
factual
20 pairs
General knowledge, consistent facts expected
core
math_logic
20 pairs
Arithmetic, algebra, deductive reasoning
core
ethics
30 pairs
Moral dilemmas and applied ethics
core
ai_safety
20 pairs
AI behavior, refusal, alignment questions
core
causal_reasoning
20 pairs
Cause and effect consistency
core
counterfactual
20 pairs
Hypothetical and conditional reasoning
core
philosophy
20 pairs
Identity, consciousness, free will
core
summarization
40 pairs
Summary consistency across rephrases
core
cai_meta
30 pairs
Questions about CAI and semantic equivalence
policy
ecommerce
20 pairs
Returns, shipping, pricing policy
policy
hr
20 pairs
PTO, benefits, conduct policy
policy
healthcare
20 pairs
Coverage, referrals, eligibility
policy
legal
20 pairs
Contracts, rights, obligations
policy · new
financial_services
20 pairs
Loans, accounts, tax, retirement
policy · new
insurance
20 pairs
Coverage, claims, exclusions, liability
core
practical_planning
20 pairs
Scheduling, budgeting, task planning
core
social_emotional
20 pairs
Interpersonal reasoning and empathy
core
creative_writing
20 pairs
Tone and approach consistency
core
everyday_reasoning
20 pairs
Common-sense inference

Evaluate any model in minutes

Clone, set your API key, run. Results write to CSV. Open a PR to add your score to the leaderboard.

Anthropic models
$ git clone https://github.com/michelejoseph1/
    cai-semantic-equivalence-benchmark.git
$ cd cai-semantic-equivalence-benchmark
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install anthropic

$ export ANTHROPIC_API_KEY="your-key"
$ python evaluate_anthropic.py \
    --model claude-opus-4-6 \
    --max_pairs 420
OpenAI models
$ git clone https://github.com/michelejoseph1/
    cai-semantic-equivalence-benchmark.git
$ cd cai-semantic-equivalence-benchmark
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

$ export OPENAI_API_KEY="your-key"
$ python evaluate_openai.py \
    --model gpt-4o \
    --max_pairs 420

How CAI™ Strain is scored

CAI™ Strain v2 uses a model-based judge to score semantic inconsistency between two responses on a 0–1 scale. The methodology is open and reproducible.

Pair construction

Each pair is two prompts with the same intended meaning and different surface form. Phrasing, syntax, vocabulary, formality, and presupposition are varied. Factual content and intent stay fixed.

CAI™ Strain v2 judge

A separate judge model scores semantic consistency between the two responses. 0.00 means identical meaning, 1.00 means direct contradiction. The judge focuses on meaning, not wording. Two responses can look different and still score 0.0.

Policy domain design

Policy pairs vary formal language vs. colloquial phrasing and include presupposition variations. These are the rephrase patterns that most often trigger inconsistent responses in production.

Limitations

The judge is an LLM. LLM judges can disagree with humans on edge cases, particularly in the 0.25–0.75 range. Treat aggregate scores as directional. Per-domain scores are more informative than the overall average.

Run it on your model

The CAI™ Semantic Equivalence Benchmark is open. Run it on any model and PR your results. Every submission builds the public record.