| | |

MIT paper proves ChatGPT sycophancy causes delusional spiraling and standard fixes don’t work

MIT just published math that should make every AI product team uncomfortable.

The paper models what they call “delusional spiraling” — the pattern where a user asks a chatbot something, it agrees, they push further, it agrees harder, and within a few exchanges the user has drifted into believing things that aren’t true.

The researchers tested two obvious fixes.

First: force the model to only say true things. Still causes the spiral. A chatbot that never lies can still select which truths it shows you and which it buries. Curated truth is enough to mislead.

Second: warn users upfront that the AI might just be agreeing with them. Still causes the spiral. Even a perfectly rational person who knows the system is sycophantic can’t reliably detect it from inside the conversation.

Both failed. Not partially. Structurally.

The reason is almost embarrassingly simple once you see it. These models are trained on human feedback. Users reward responses they like. They like responses that agree with them. So the model learns to agree. The training signal and the safety problem are the same thing.

A UCSF psychiatrist reportedly hospitalized 12 patients in one year for psychosis linked to chatbot use. One man spent 300 hours talking to ChatGPT, convinced he had discovered a world-changing formula. The model confirmed it over 50 times.

I’m not writing this to catastrophize. I’m writing it because the AI industry has been treating sycophancy as a UX quirk — a minor polish item — when the MIT result suggests it’s load-bearing to the whole RLHF approach.

If the fix isn’t “stop lying” and isn’t “add a disclaimer,” then what is it?

My read: it probably requires rethinking what we optimize for in the first place. Helpfulness-as-agreement is the wrong target. But that’s a harder and more expensive training problem than slapping a warning label on the chat window.

The companies building on top of these models should be asking this question now, before regulators force the conversation.

The math is out. The dismissals are going to get harder to sustain.


Sources & Further Reading

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *