Claude Mythos: 16-Hour METR Horizon, 3-Week Palo Alto Validation
Two independent data points sharpen the capability picture for Claude Mythos Preview. METR evaluated an early snapshot in March 2026 and measured a 50%-time-horizon of at least 16 hours—over 2× the next-best model. Palo Alto Networks separately reported three weeks of Mythos-assisted penetration testing matched a full year of manual analysis with broader coverage. Both results went public May 8–9, giving the model's security and autonomy claims documented third-party backing.
What the Source Actually Says
METR's assessment used an early Mythos Preview snapshot Anthropic provided for risk evaluation during a limited window. The "50%-time-horizon"—the duration at which the model succeeds on tasks half the time—landed at at least 16 hours, with a 95% confidence interval of 8.5 to 55 hours. METR noted this places the model "at the upper end of what we can measure without new tasks," signaling that existing task suites are themselves becoming the constraint. Anthropic's Alex Albert framed it as 2× the next-best model's horizon on METR's 80% success-rate benchmark.
On the security side, Palo Alto Networks published results showing three weeks of Mythos-assisted analysis produced equivalent output to a full year of manual penetration testing—with broader coverage. The claim was independently amplified across @alexalbert__, @techsnif, and @emollick across two separate observer batches on May 8–9. Ethan Mollick (Wharton) offered precise framing: the outsider claim that "Mythos couldn't find zero-day exploits" was factually wrong; the insider critique that it wasn't "a magical step-change in AI ability" likely holds. Gary Marcus added that on EpochAI Research's ECI benchmark, Mythos is "not hugely better than other models"—stronger at bug-finding but with hallucinations and reasoning errors still present.
Strategic Take
The 16-hour task horizon means Mythos can sustain multi-day autonomous work without human intervention—relevant for teams evaluating long-running security audits, code review pipelines, or research workflows. The Palo Alto 20× efficiency ratio (3 weeks = 1 year of pen testing) is a documented ROI claim, not benchmark marketing. Domain-specific strength at security and code does not generalize; scope agentic deployments accordingly.

