OpenAI Publishes GPT-5.1 'Goblin' Personality Artifact Post-Mortem
OpenAI published a transparent post-mortem on the goblin-like behavior that appeared alongside the GPT-5.1 launch. Root cause: a "nerdy personality" training configuration in Codex that over-rewarded goblin and magical content mentions, reinforcing across successive model generations. The fix — removing the affine reward signal and filtering training data where creature-themed text appeared in irrelevant contexts — has been applied to future models.
Why It Matters
This level of public training-artifact disclosure is rare; it sets a precedent for model-behavior transparency and offers a concrete case study for alignment practitioners. The behavior can reportedly still be triggered inside Codex.