Claude Model Performance Expectations Drop 20 Points on Humanity's Last Exam Benchmark

What Happened

A prediction market tracking whether any Anthropic Claude model will score at least 45% on Humanity's Last Exam experienced a significant 20.5 percentage point decline in win odds, falling from 93% to 72.5%. The market moved on elevated volume of $106,380, indicating substantial trader activity and conviction behind the price movement. The shift occurred without public announcement of official test results, suggesting traders are reacting to information about either Claude's actual performance or the difficulty profile of the examination.

Why It Matters

Humanity's Last Exam is a rigorous benchmark designed to test AI systems on complex, open-ended problems requiring deep reasoning across multiple domains. A drop of this magnitude in a high-confidence market reflects meaningful reassessment of Claude's capabilities relative to the test's difficulty. The move from 93% odds—indicating near-certainty of clearing the 45% threshold—to 72.5% suggests traders now view the outcome as genuinely uncertain, a substantial shift in expectations for a model from a leading AI company.

Market Context

The 45% performance threshold represents a meaningful but not elite level of achievement on the benchmark. That Claude traders moved from pricing near-certainty to odds closer to a coin flip implies either new information about the test's difficulty or evidence of Claude's performance gap on specific question categories. The high trading volume indicates this was not a thin-market artifact but rather reflected coordinated reassessment across multiple traders.

Outlook

Markets will likely remain volatile until official Humanity's Last Exam leaderboard results become available. The June 30, 2026 resolution deadline provides time for results to materialize. If Claude achieves at least 45% when scores are posted, the market would likely swing sharply back toward higher probabilities. The current 72.5% odds suggest traders view the threshold as achievable but no longer nearly certain, positioning the market to provide meaningful information about Claude's performance on complex reasoning tasks when results are published.