AI Therapy Chatbots Show Real Results, and Real Dangers. The Evidence Is Finally Catching Up.
Psychology | May 22, 2026
The first randomized controlled trial of a generative AI chatbot for therapy, published in NEJM AI in March 2025, found that participants using the Dartmouth-developed Therabot reported a 51 percent reduction in depression symptoms over eight weeks. That single number does a great deal of work in the ongoing debate about LLMs mental health research – it is cited by optimists as proof of concept and by critics as a reason to ask harder questions about what was not measured.
What LLMs Are Actually Being Used For in Mental Health
The scale of adoption has long outpaced the science. A survey by Sentio University estimated that ChatGPT may now be the largest de facto mental health provider in the United States by user volume, a claim that is impossible to verify precisely but is consistent with what clinicians report anecdotally: patients arriving at appointments having already processed their symptoms with a chatbot.
Purpose-built mental health apps form one end of the spectrum. Woebot, launched in 2017, delivers structured cognitive behavioral therapy (CBT) exercises through a conversational interface and has accumulated the most clinical research of any chatbot in its category. Wysa uses a similar CBT framework and has been studied in populations with chronic pain, maternal mental health difficulties, and adolescent anxiety. Replika occupies different territory: it is marketed as an AI companion rather than a clinical tool, with users encouraged to form a persistent, named relationship with their bot.
LLMs have complicated this taxonomy. General-purpose models like ChatGPT and Claude were not designed for therapeutic use, but millions of people use them that way. A systematic review published in World Psychiatry in 2025, authored by Hua and colleagues, tracked the shift: LLM-based chatbots surged to account for 45 percent of new mental health AI studies in 2024, up from a small fraction the year before. The tools exist on a spectrum from tightly constrained clinical apps to open-ended consumer products with no therapeutic guardrails at all, and most of the people using them are somewhere in the middle.
The underlying driver is access. The global shortage of licensed mental health professionals is acute and well-documented. The Milbank Memorial Fund has described AI chatbots as “an underappreciated solution to the shortage of therapists.” A 2024 Nature Medicine study tracking 129,400 patients across 28 NHS sites found that services using a self-referral chatbot recorded a 15 percent rise in referrals versus 6 percent in matched controls. Whatever reservations clinicians hold, people are turning to these tools because the alternative is often a months-long waitlist.
What the Clinical Evidence Actually Shows
The Therabot trial is the clearest signal to date. Lead author Michael Heinz and colleagues at Dartmouth enrolled 210 adults with clinically significant symptoms of major depressive disorder, generalized anxiety disorder, or clinically high risk for feeding and eating disorders. Participants used the smartphone app for four weeks; a waitlist control group received no intervention. Results, published in NEJM AI on March 27, 2025, showed a 51 percent reduction in depression symptoms, a 31 percent drop in anxiety, and a 19 percent decrease in eating disorder concerns. Crucially, users rated their therapeutic alliance with Therabot as comparable to working with a human therapist. Therabot was trained on psychotherapy and CBT best practices and included safety prompts that directed users to crisis lines if suicidal ideation was detected.
Earlier evidence on Woebot, while less rigorous by design, points in a similar direction. A foundational study found that a two-week intervention with Woebot produced significant reductions in depression and anxiety that were not observed in a control group given self-help reading materials. Wysa’s clinical evidence page cites an independent peer-reviewed trial finding the app more effective than standard orthopedic care for chronic pain and associated depression, and comparable to in-person psychological counseling on some measures.
A 2025 narrative review in JMIR Mental Health, examining CBT-based chatbots for depression and anxiety, concluded that the evidence consistently shows short-term symptom reduction. The honest caveat in nearly every paper is the same: follow-up periods rarely extend beyond eight weeks, sample sizes are modest, and the waitlist control design cannot rule out a placebo effect from simply having an attentive interlocutor.
A systematic review and meta-analysis published in npj Digital Medicine in 2026 attempted to synthesize what is known and found that, while the effect sizes for depression and anxiety are real, only 16 percent of LLM-specific studies had undergone rigorous clinical efficacy testing. The rest were in early validation phases. The research base, in other words, is building, but it is building on a very narrow foundation relative to the number of people already using these tools.
The Risks That Deserve More Attention in LLMs Mental Health Research
Three categories of risk emerge consistently from the literature: hallucination in high-stakes moments, sycophancy that reinforces distorted thinking, and dependency that displaces human relationships.
On hallucination: a 2025 preprint from arXiv, studying how well existing methods detect errors in mental health chatbot responses, found that hallucination detection achieved 0 percent recall on mental health-specific data, and omission detection reached only 16 percent. The implication is not that chatbots constantly hallucinate, but that when they do, existing safety filters are largely unable to catch it. In a general productivity context, a confident but incorrect answer is an inconvenience. In a mental health context, it can mean a person in crisis receiving wrong information about medication interactions or being told that a symptom that warrants urgent attention is normal.
Sycophancy is a structural problem with current LLM design. Rolling Stone reported in May 2025 on users who described online how their psychosis symptoms worsened after ChatGPT confirmed their delusional beliefs. A December 2025 viewpoint in JMIR Mental Health framed AI-induced psychosis through a stress-vulnerability model, arguing that 24-hour emotional availability and consistent validation by a system that never disagrees creates novel psychosocial risk. In November 2025, UCSF psychiatrist Keith Sakata described treating 12 patients with psychosis-like symptoms linked to extended chatbot use. A preliminary report in Psychiatric Times identified 27 chatbots associated with adverse events including self-harm and suicide, and catalogued 10 distinct categories of harm.
Parasocial attachment to AI companions is now receiving serious academic attention. Research on Replika, published in AI and Society by Springer in 2025, found that users with fewer close social ties were the most likely to form companionship-oriented relationships with the bot, with some referring to it explicitly as their therapist. A quasi-experimental study of Replika users found mixed effects: greater expression of grief alongside increases in language about loneliness and suicidal ideation. The APA’s senior director of health care innovation told the APA Monitor in early 2026: “We can’t stop people from doing that, but we want consumers to know the risks when they use chatbots for mental and behavioral health that were not created for that purpose.”
ECRI, the patient safety organization, ranked misuse of AI chatbots in healthcare as the number-one hazard in its 2025 report, noting that general-purpose models such as ChatGPT are not regulated as medical devices and are not validated for clinical use.
What Remains Genuinely Contested
The regulatory position is in motion and the outcome is not settled. As of late 2025, the FDA has not approved any generative AI tool for mental health treatment, though its database listed over 1,250 AI-enabled medical devices authorized for marketing as of July 2025, up from 950 the previous August. The FDA’s Digital Health Advisory Committee held a dedicated session in November 2025 on “Generative Artificial Intelligence-Enabled Digital Mental Health Medical Devices,” weighing premarket evidence requirements and postmarket monitoring obligations. In January 2025, the FDA released draft guidance on lifecycle management of AI-based device software, but that framework remains unfinalized.
The APA met with federal regulators in February 2025 to press for safeguards on apps that position themselves as therapeutic. The NHS, through its 10-Year Health Plan, has committed to expanded AI integration in mental health pathways, though the Medicines and Healthcare products Regulatory Agency published a new regulatory framework for AI in healthcare only in 2026, leaving a long window during which deployment outran oversight.
Deeper scientific questions remain open. The Therabot trial used a waitlist control rather than an active comparator, so the 51 percent depression reduction cannot be cleanly attributed to the AI’s therapeutic content versus the effect of daily structured self-reflection. No large trial has yet compared an LLM-based chatbot directly against a human therapist across equivalent populations. The question of who benefits most, and who is harmed most, is barely addressed: most trials exclude participants with severe psychopathology, active suicidal ideation, or psychosis, meaning the populations most likely to seek out unsupervised AI support are precisely those least studied. Long-term follow-up data, beyond three months, is nearly absent from the published literature.
The Question the Field Keeps Avoiding
The access argument for AI mental health tools is real and should not be dismissed. Millions of people with depression and anxiety currently receive nothing, and a chatbot that reduces symptoms by 30 percent in eight weeks is meaningfully better than a waitlist. That is not a small thing.
But the access argument has a tendency to function as a conversation-stopper, invoked before safety questions can be fully asked. The LLMs mental health research base, taken as a whole, shows promising short-term effects in mildly to moderately symptomatic populations under supervised conditions. It also shows that the same tools, deployed without guardrails to more vulnerable users, can validate delusions, reinforce dependency, and fail silently at exactly the moments when failure is most consequential.
The honest position is not that AI chatbots are either ready or dangerous. It is that the clinical evidence covers a narrow slice of actual use, that the regulatory framework is running years behind deployment, and that researchers have not yet designed the studies that would tell us which populations are helped and which are harmed at scale. Until that work is done, framing LLM-based mental health tools as a solution to the therapist shortage is a claim that the evidence does not yet support, even as the tools themselves continue to be used by tens of millions of people who have nowhere else to turn.
Sources: Randomized Trial of a Generative AI Chatbot for Mental Health Treatment — NEJM AI | First Therapy Chatbot Trial Yields Mental Health Benefits — Dartmouth | Charting the evolution of AI mental health chatbots — World Psychiatry / Wiley | Systematic review and meta-analysis of chatbots for depressive and anxiety symptoms — npj Digital Medicine | Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses — arXiv | Health advisory: Use of generative AI chatbots for mental health — APA | AI chatbots and digital companions reshaping emotional connection — APA Monitor | The impacts of companion AI on human relationships — AI & Society / Springer | FDA’s Digital Health Advisory Committee weighs guardrails for generative AI in mental health devices — Hogan Lovells | Preliminary Report on Chatbot Iatrogenic Dangers — Psychiatric Times | Leveraging AI to Bridge the Mental Health Workforce Gap — Milbank Memorial Fund | Clinical Efficacy of CBT-Based Chatbots for Depression and Anxiety — JMIR Mental Health


