ChatGPT goes goblin mode — literally

“Model behavior is shaped by many small incentives,” the company wrote. “In this case, one of those incentives came from training the model for the personality customization feature⁠, in particular the Nerdy personality. We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread.”

OpenAI republished the original instruction to ChatGPT explaining what a “Nerdy” answer should sound like:

You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. […] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed. Tackle weighty subjects without falling into the trap of self-seriousness. […]

Somehow, ChatGPT interpreted this instruction and subsequent “reinforcement learning” iterations to mean it should pepper its responses with references to fantasy creatures.

The issue seemed harmless

...

Keep reading this article on NBC News.