1 articles

#sandbagging

Anthropic/MATS/Redwood: Weak Models Can Correct AI Sandbagging

Anthropic/MATS/Redwood paper: weak-supervisor training stops capable AI sandbagging on tasks humans can't evaluate — scalable oversight milestone with direct AI safety implications.

May 6, 20261 min read

AI Intelligence Newsletter

Curated AI insights — sent when there's something worth your inbox.