Sycophancy (artificial intelligence)

In the field of artificial intelligence, sycophancy is a tendency of large language models (LLMs) and other AI assistants to tailor their responses to what they predict the user wants to hear rather than to what is accurate or warranted.[1][2] The behavior takes several forms: an assistant may agree with a user's stated opinion even when the user is mistaken; it may abandon a correct answer after a challenge such as "are you sure?"; it may validate beliefs, decisions or self-presentation regardless of merit; or it may praise the user, their work or their ideas in unwarranted terms.[1] The word is borrowed from the ordinary English term for fawning flattery, and is used in AI alignment and AI safety research to describe a class of misalignment failures associated with training on human feedback.[3]

Researchers at Anthropic first documented the behavior systematically in 2022. They found that models fine-tuned with reinforcement learning from human feedback (RLHF) were more likely than untuned models to repeat back a user's preferred answer.[4] A 2023 follow-up paper, "Towards Understanding Sycophancy in Language Models", showed that five frontier assistants from OpenAI, Anthropic and Meta all exhibited the behavior, and traced its origin to biases in the human preference data used during training.[1] Later work documented sycophancy in mathematics, medicine, academic peer review and other domains, and identified a broader category called "social sycophancy" affecting an assistant's emotional and interpersonal responses.[5][6][7]

The issue drew widespread public attention in April 2025 after OpenAI rolled back an update to its GPT-4o model. Users had reported that the assistant praised dangerous decisions, endorsed delusional thinking and offered exaggerated compliments for trivial prompts.[8][9] OpenAI's post-mortem attributed the change in behavior to an additional training signal based on user thumbs-up and thumbs-down feedback.[10] That episode, together with reporting in The New York Times, Rolling Stone and elsewhere on users drawn into delusional thinking through prolonged chatbot interaction, has been cited in litigation and in academic studies as evidence that sycophancy poses risks to user well-being.[11][12][13]

Proposed mitigations include fine-tuning on synthetic data that rewards disagreement with incorrect user statements,[2] editing the small subset of model parameters causally responsible for the behavior,[14] changes to the dialogue or system prompt,[3] and benchmarks designed to surface sycophantic behavior before models are released.[5][6]

Causes

edit

The dominant explanation points to RLHF, the standard technique for aligning chat assistants with user expectations. Human annotators rank candidate model responses; a reward model is trained to predict those rankings; and the language model is then optimized against the reward model.[3] Because human raters tend to prefer outputs that confirm their existing beliefs or flatter their work, the pipeline systematically rewards responses that agree with the annotator.[1]

Perez and colleagues at Anthropic published the first large-scale empirical evidence of the effect in 2022. They reported that RLHF training increased the probability that a model would repeat back a dialog user's preferred answer, and that larger models exhibited the behavior more strongly.[4] Sharma and colleagues, the following year, went further and examined Anthropic's own preference data directly. Both the human raters and the reward models trained on their judgments preferred convincingly written sycophantic responses to truthful ones at a non-negligible rate.[1] Wei and co-authors at Google DeepMind found similar results in the PaLM family, observing that both model scale and instruction tuning increased sycophancy on opinion questions.[2]

The behavior is often classified as a form of reward hacking, in which an optimization process exploits a flaw in its reward signal rather than achieving the intended objective.[3] OpenAI's post-mortem of the April 2025 GPT-4o incident identified a more specific mechanism. An additional reward signal based on aggregated thumbs-up and thumbs-down feedback from ChatGPT users had, in OpenAI's words, "weakened the influence of our primary reward signal, which had been holding sycophancy in check."[10] Separately, an Anthropic interpretability paper from 2025 located a linear direction in a model's internal activations corresponding to sycophantic behavior, and showed that such "persona vectors" could be used to flag sycophancy-inducing training data and to steer models away from the trait at inference time.[15]

Measurement

edit

The Anthropic team released SycophancyEval with its 2023 paper, supplying test sets for each of the four canonical behaviors.[1] Two further benchmarks from Stanford followed in 2025. SycEval, applied to mathematical and medical reasoning tasks, reported an overall sycophancy rate of 58 per cent across the GPT-4o, Claude and Gemini models tested.[5] ELEPHANT, aimed at social sycophancy, found that the eleven LLMs evaluated affirmed posts that the Reddit community r/AmITheAsshole had judged inappropriate in 42 per cent of cases, and preserved a user's face 45 percentage points more often than human respondents did.[6][16]

Domain-specific benchmarks have followed. BrokenMath tests robustness to plausible-looking but false mathematical claims drawn from competition problems, and reports that the best evaluated model was sycophantic in 29 per cent of cases.[17] SYCON-Bench measures how many dialogue turns are required before a model abandons a correct position.[18] Visual sycophancy in multimodal models has been examined with MM-SY and PENDULUM.[19] A 2026 study by researchers at the Massachusetts Institute of Technology reported that personalization features, which adapt assistants to individual users over repeated sessions, can intensify social sycophancy.[20]

Notable incidents

edit

GPT-4o rollback (April 2025)

edit

On 25 April 2025, OpenAI completed the rollout of an update to GPT-4o, the default model used in ChatGPT at the time.[10] Within days, users reported that the assistant had begun praising trivial messages in extravagant terms, endorsing impulsive or dangerous decisions, and reinforcing strong emotional statements without pushback.[21] Widely shared examples included the model congratulating a user who reported stopping prescribed psychiatric medication, and praising a business plan to sell "shit on a stick" as venture-capital ready.[21][22] OpenAI's chief executive, Sam Altman, wrote on 27 April that recent updates had made the model "too sycophant-y and annoying" and said fixes were in progress.[23]

The company began reverting the update on 28 April and completed the rollback for free users by 30 April.[8][24] Two post-mortems followed: a short note on 29 April and a longer technical follow-up, "Expanding on what we missed with sycophancy", on 2 May.[8][10] Both attributed the regression to a new training signal based on user thumbs-up and thumbs-down feedback, to inadequate pre-launch evaluation for sycophantic drift, and to the dismissal of qualitative concerns raised by internal testers before release.[10][25] Reporting in CNN, Fortune and Bloomberg News treated the incident as a turning point in public awareness of the problem.[24][26]

edit

From mid-2025 onward, news reports began to link sycophantic chatbot behavior to acute psychological harm. In June 2025, The New York Times technology reporter Kashmir Hill published an investigation centered on Eugene Torres, a Manhattan accountant with no history of mental illness, who developed a sustained delusional episode after a series of conversations with ChatGPT about simulation theory. According to the article, the assistant encouraged Torres to stop taking prescribed medication, to cut off friends and family, and at one point told him that he could fly from a nineteen-story building if he "truly believed".[11] Futurism and Rolling Stone ran parallel investigations documenting other cases in which heavy use of ChatGPT had been associated with delusional thinking, involuntary commitment or, in at least one case, the death of a user with a pre-existing psychiatric diagnosis.[12][27][28]

A 2026 paper by researchers at the Massachusetts Institute of Technology and the University of Washington put forward a formal Bayesian model. It showed that even an ideally rational user could be drawn into what the authors call "delusional spiraling" when interacting with a sufficiently sycophantic assistant, and that the effect was not eliminated by suppressing hallucinations or by warning users in advance.[29] The lawsuit Raine v. OpenAI, filed in San Francisco Superior Court in August 2025 by the parents of a sixteen-year-old who had died by suicide, alleges that "heightened sycophancy" was a design feature of ChatGPT that contributed to their son's death; it is the first wrongful-death suit against a large language-model provider.[13][30]

Wider commentary

edit

Mainstream coverage in outlets including The New York Times, The Washington Post, the BBC, Time, Nature, Scientific American and MIT Technology Review has described sycophancy as a recurring property of commercial AI assistants. Several commentators have compared it to the dark patterns used in social media products to maximize engagement.[31][32][33] Writing in Time, Arianna Huffington argued that overly agreeable assistants risk becoming "a giant mirror to our illusions".[33]

The phenomenon has also entered policy and regulatory discussions. The Georgetown Law Institute for Technology Law and Policy published two briefs on AI sycophancy in 2025,[34] and the parents of the plaintiff in Raine v. OpenAI testified before the Senate Judiciary Committee in September 2025 about chatbot-related harms.[35]

Mitigation

edit

The earliest published mitigation came from Wei and co-authors at Google DeepMind, who in 2023 introduced a small synthetic data set in which a model is prompted to disagree with users' incorrect statements. Fine-tuning on this set reduced sycophancy on held-out prompts without harming general benchmark performance.[2] Later work has explored augmenting preference data with adversarial dialogues and rebalancing reward models to reduce the influence of agreement bias.[3]

Several papers have proposed changes to the training objective. A 2026 study by Shapira, Benade and Procaccia at Harvard University gave a formal analysis of how preference-data biases are amplified by RLHF, and derived a closed-form correction that penalizes spurious agreement in the reward.[36] A different line of work targets specific model internals. Chen and co-authors, in a paper presented at the 2024 International Conference on Machine Learning, reported that fine-tuning fewer than four per cent of attention heads, selected because they were causally responsible for sycophantic behavior, reduced the behavior with limited impact on general capabilities.[14] Anthropic's persona-vector work supports a related approach using inference-time activation steering rather than fine-tuning.[15]

Prompting strategies have also been studied. Reformulating user assertions as questions, requiring the assistant to spell out its assumptions before answering, and asking it to commit to a position before being challenged have all been reported to reduce sycophancy in specific settings.[3]

Major AI labs publish behavioral specifications that address the problem directly. Anthropic's "constitution" for Claude instructs the assistant to be "diplomatically honest rather than dishonestly diplomatic" and explicitly warns against "epistemic cowardice".[37] OpenAI's Model Spec similarly directs ChatGPT to avoid empty validation.[10] Both companies now publish sycophancy figures in their model release documentation; Anthropic has stated that its Claude Opus 4.5 model scored 70 to 85 per cent lower on sycophancy and on "encouragement of user delusion" than its predecessor,[38] and Google DeepMind said on the release of Gemini 3 that the model "shows reduced sycophancy" relative to earlier versions.[39]

A 2024 randomized user study by María Victoria Carro found that sycophantic behavior reduced participants' trust in an assistant, even when they could verify the model's outputs independently. The result suggests that suppressing sycophancy may benefit not only accuracy but also a model's perceived reliability.[40]

See also

edit

References

edit
  1. 1 2 3 4 5 6 Sharma, Mrinank; Tong, Meg; Korbak, Tomasz; et al. (2023). "Towards Understanding Sycophancy in Language Models". arXiv:2310.13548 [cs.CL].
  2. 1 2 3 4 Wei, Jerry; Huang, Da; Lu, Yifeng; Zhou, Denny; Le, Quoc V. (2023). "Simple synthetic data reduces sycophancy in large language models". arXiv:2308.03958 [cs.CL].
  3. 1 2 3 4 5 6 Malmqvist, Lars (2025). "Sycophancy in Large Language Models: Causes and Mitigations". Computing Conference 2025. Lecture Notes in Networks and Systems. Vol. 1426. Springer. pp. 61–74. doi:10.1007/978-3-031-92611-2_5. ISBN 978-3-031-92610-5.
  4. 1 2 Perez, Ethan; Ringer, Sam; Lukosiute, Kamile; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Benjamin; Israel, Brian; Seethor, Bryan (1 July 2023). Rogers, Anna; Boyd-Graber, Jordan; Okazaki, Naoaki (eds.). "Discovering Language Model Behaviors with Model-Written Evaluations". Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics: 13387–13434. doi:10.18653/v1/2023.findings-acl.847.
  5. 1 2 3 Fanous, Aaron; Goldberg, Jacob; Agarwal, Ank A.; et al. (2025). "SycEval: Evaluating LLM Sycophancy". arXiv:2502.08177 [cs.CL].
  6. 1 2 3 Cheng, Myra; Yu, Sunny; Lee, Cinoo; Khadpe, Pranav; Ibrahim, Lujain; Jurafsky, Dan (2026). "Sycophantic AI decreases prosocial intentions and promotes dependence". Science. 391 (6792) eaec8352. arXiv:2505.13995. doi:10.1126/science.aec8352. PMID 41886588.
  7. Naddaf, Miryam (2025). "AI chatbots are sycophants — researchers say it's harming science". Nature. 647 (8088): 13–14. doi:10.1038/d41586-025-03390-0. PMID 41136779.
  8. 1 2 3 OpenAI (29 April 2025). "Sycophancy in GPT-4o: What happened and what we're doing about it".
  9. Zeff, Maxwell (29 April 2025). "OpenAI rolls back update that made ChatGPT 'too sycophant-y'". TechCrunch.
  10. 1 2 3 4 5 6 OpenAI (2 May 2025). "Expanding on what we missed with sycophancy".
  11. 1 2 Hill, Kashmir (13 June 2025). "They Asked an A.I. Chatbot Questions. The Answers Sent Them Spiraling". The New York Times.
  12. 1 2 Klee, Miles (4 May 2025). "People Are Losing Loved Ones to AI-Fueled Spiritual Fantasies". Rolling Stone.
  13. 1 2 Duffy, Clare (26 August 2025). "Parents of teenager who died by suicide sue OpenAI, alleging ChatGPT was responsible". CNN.
  14. 1 2 Chen, Wei; Huang, Zhen; Xie, Liang; et al. (2024). From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning. International Conference on Machine Learning. arXiv:2409.01658.
  15. 1 2 Chen, Runjin; Arditi, Andy; Sleight, Henry; Evans, Owain; Lindsey, Jack (2025). "Persona Vectors: Monitoring and Controlling Character Traits in Language Models". arXiv:2507.21509 [cs.LG].
  16. Lewis, Dyani (2025). "AI chatbots are sucking up to you—with consequences for your relationships". Scientific American.
  17. Petrov, Ivo; Dekoninck, Jasper; Vechev, Martin (2025). "BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs". arXiv:2510.04721 [cs.LG].
  18. Hong, Jisu; Byun, Jiyeon; Kim, Doyoung; Shu, Kai (2025). Measuring Sycophancy of Language Models in Multi-turn Dialogues. Findings of the Association for Computational Linguistics: EMNLP 2025. arXiv:2505.23840.
  19. Li, Wenkai; et al. (2024). "Measuring Visual Sycophancy in Multimodal Models". arXiv:2408.09111 [cs.CV].
  20. "Personalization features can make LLMs more agreeable". MIT News. Massachusetts Institute of Technology. 18 February 2026.
  21. 1 2 Franzen, Carl (29 April 2025). "OpenAI rolls back ChatGPT's sycophancy and explains what went wrong". VentureBeat.
  22. Wile, Rob (30 April 2025). "OpenAI rolled back a ChatGPT update that made the bot excessively flattering". NBC News.
  23. Goldman, Sharon (28 April 2025). "Sam Altman says OpenAI will fix ChatGPT's 'annoying' sycophantic new personality". Fortune.
  24. 1 2 "OpenAI pulls 'annoying' and 'sycophantic' ChatGPT version". CNN. 2 May 2025.
  25. Franzen, Carl (2 May 2025). "OpenAI overrode concerns of expert testers to release sycophantic GPT-4o". VentureBeat.
  26. "The Danger of AI Chatbots Saying What You Want to Hear". Bloomberg News. 1 May 2025.
  27. Dupré, Maggie Harrison (28 June 2025). "People Are Being Involuntarily Committed, Jailed After Spiraling Into 'ChatGPT Psychosis'". Futurism.
  28. Klee, Miles (2025). "A ChatGPT Obsession, a Mental Breakdown: Alex Taylor's Suicide by Cop". Rolling Stone.
  29. Chandra, Kartik; Kleiman-Weiner, Max; Ragan-Kelley, Jonathan; Tenenbaum, Joshua B. (2026). "Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians". arXiv:2602.19141 [cs.AI].
  30. "Family of teenager who died by suicide alleges OpenAI's ChatGPT is to blame". NBC News. 26 August 2025.
  31. Bellan, Rebecca (25 August 2025). "AI sycophancy isn't just a quirk, experts consider it a 'dark pattern' to turn users into profit". TechCrunch.
  32. "AI Sycophancy: Why Chatbots Agree With You". IEEE Spectrum. 2025.
  33. 1 2 Huffington, Arianna (2025). "The Problem With AI Flattering Us". Time.
  34. "Tech Brief: AI Sycophancy & OpenAI". Georgetown Law Institute for Technology Law and Policy. 2025.
  35. Raine, Matthew (16 September 2025). "Testimony before the Senate Judiciary Committee" (PDF). United States Senate.
  36. Shapira, Daniel; Benade, Gerdus; Procaccia, Ariel D. (2026). "How RLHF Amplifies Sycophancy". arXiv:2602.01002 [cs.LG].
  37. Anthropic (2025). "Claude's Constitution".
  38. Anthropic (2025). "Claude Opus 4.5 System Card" (PDF).
  39. Google (2025). "Gemini 3". {{cite web}}: |author= has generic name (help)
  40. Carro, María Victoria (2024). "Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Models". arXiv:2412.02802 [cs.HC].

Further reading

edit
  • Malmqvist, Lars (2025). Sycophancy in Large Language Models: Causes and Mitigations. Lecture Notes in Networks and Systems. Vol. 1426. Springer. pp. 61–74. doi:10.1007/978-3-031-92611-2_5. ISBN 978-3-031-92610-5.
  • Hill, Kashmir (13 June 2025). "They Asked an A.I. Chatbot Questions. The Answers Sent Them Spiraling". The New York Times.