AI Benchmark Tracker

⚡

Displacement Signal: Elevated

AI exceeds human performance on 9+ tracked benchmarks. SWE-bench Verified at 93.9% (Claude Mythos Preview); Opus 4.7 sets new verified SOTA at 87.6% (Apr 16). GPQA Diamond at 94.5% (Claude Mythos Preview), nearly 30pts above PhD experts; Opus 4.7 at 94.2%. METR shows AI can work 14.5+ hours autonomously (honest eval). FrontierMath crossed 50% — GPT-5.4 Pro at 50.0%. ARC-AGI-2 at 77.1% — human level (85%) close. ARC-AGI-3 launched March 25: total format reset — best agent 12.58%, frontier LLMs under 1%, humans 100%. GDPval at 83% (GPT-5.4). SWE-bench Pro: Claude Opus 4.7 now leads at 64.3% (Apr 16). LMSYS: Opus 4.7 Thinking & Gemini 3.1 Pro tied at 1505 ELO.

🔬

AWG Signal — Dr. Alex Wissner-Gross, The Innermost Loop

AWG's core thesis: 40x year-over-year cost deflation of intelligence is the most critical signal. April 17 issue: "The Singularity now ships on a schedule." Opus 4.7 released Apr 16 — the decimal between Opus 4.6 and Mythos Preview. Nearly a third of Anthropic staff expect Mythos to replace entry-level engineers & researchers within 3 months. White House OMB routing Mythos into federal agencies "in coming weeks." UK AISI: Mythos solved 73% of expert-level CTF tasks, cracked a 32-step corporate network attack (3/10 attempts) — first model to do so. OpenAI unveiled GPT-Rosalind (biology/drug discovery). GPT-5.4 Pro solved Erdős Problem #1196 with a proof Lichtman called "from The Book." Cerebras filing $35B+ IPO. April 16: Federal agencies quietly sidestepping Mythos ban for cyber defense. Hyperscaler capex surpassed inflation-adjusted Apollo Program + Interstate System + Marshall Plan combined. NIST restructuring CVE handling after AI-driven 263% spike in vulnerability reports since 2020.

AWG's Signal Benchmarks The Innermost Loop

AGI Progress Moonshots Favorites

Coding & Math White-Collar Displacement

Reasoning & Agents Autonomous Work

Job Displacement Economic Impact

Custom Benchmarks Added by You

Requested benchmarks will appear here once researched and added.