Researchbreaking
Anthropic/MATS/Redwood: Weak Models Can Correct AI Sandbagging
Anthropic/MATS/Redwood paper: weak-supervisor training stops capable AI sandbagging on tasks humans can't evaluate — scalable oversight milestone with direct AI safety implications.
May 6, 20261 min read