Can Small Language Models Generate Therapist-Like Responses? A Lightweight Study of Therapist Imitation in Mental Health Support

  • Yifan Zhang Teachers College, Columbia University, United States of America
  • Zhongwen Zhou University of California, Berkeley, United States of America
Keywords: Empathetic Dialogue, Mental Health Support, Small Language Models, Therapist Imitation

Abstract

Therapist-like response generation is increasingly discussed in digital mental health, yet most studies either focus on large pretrained systems or show illustrative outputs without a full lightweight benchmark. This paper asks whether small, non-pretrained language models can imitate therapist-style discourse in a reproducible setting. We used EmpatheticDialogues, an empathy-oriented dialogue corpus of roughly 25,000 conversations (Rashkin et al., 2019). Its widely used utterance-level split is 76,673 training records, 12,030 validation records, and 10,943 test records; the parsed conversation release used here contained 19,532/2,769/2,546 dialogues and yielded 40,252/5,736/5,257 listener-turn targets after we restricted supervision to supportive listener responses. We evaluated six lightweight systems on the full validation and test sets: an emotion template, TF-IDF retrieval, retrieval with micro-skill bias, an emotion-conditioned bigram language model, an emotion-conditioned trigram language model, and a trigram model with therapist-style biasing. All reported numbers are measured empirical results. The best overall system, Emotion-TrigramLM+Bias, achieved BLEU-4 of 0.0191 on validation and 0.0183 on test, ROUGE-L of 0.1652/0.1633, and therapist imitation score (TIS) of 0.6500/0.6487. Retrieval remained the most diverse model, reaching test Distinct-2 of 0.2551, but its therapist-style density was low at TIS = 0.2005. Adding therapist micro-skill bias improved retrieval by +0.0042 BLEU-4 and +0.3603 TIS on the test set, and improved the trigram model by +0.0059 BLEU-4 and +0.3056 TIS. Performance was strongest on negative-emotion turns, where acknowledgments and follow-up questions aligned closely with the references. The findings show that very small models can imitate the surface form of therapeutic language surprisingly well, but they do so mainly by compressing support into generic scripts. Lightweight therapist imitation is therefore feasible for low-risk acknowledgment support, but it is not a replacement for licensed mental health care.

Downloads

Download data is not yet available.

References

Davis, M. H. (1983). Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of Personality and Social Psychology, 44(1), 113-126. https://doi.org/10.1037/0022-3514.44.1.113

Decety, J., & Jackson, P. L. (2004). The functional architecture of human empathy. Behavioral and Cognitive Neuroscience Reviews, 3(2), 71-100. https://doi.org/10.1177/1534582304267187

Dinan, E., Roller, S., Shuster, K., Fan, A., Auli, M., & Weston, J. (2019). Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the 7th International Conference on Learning Representations.

Elliott, R., Bohart, A. C., Watson, J. C., & Greenberg, L. S. (2011). Empathy. Psychotherapy, 48(1), 43-49. https://doi.org/10.1037/a0022187

Fitzpatrick, K. K., Darcy, A., & Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Mental Health, 4(2), e19. https://doi.org/10.2196/mental.7785

Fulmer, R., Joerin, A., Gentile, B., Lakerink, L., & Rauws, M. (2018). Using psychological artificial intelligence (Tess) to relieve symptoms of depression and anxiety: Randomized controlled trial. JMIR Mental Health, 5(4), e64. https://doi.org/10.2196/mental.9782

Hill, C. E. (2014). Helping skills: Facilitating exploration, insight, and action (4th ed.). American Psychological Association. https://doi.org/10.1037/14345-000

Hojat, M., Mangione, S., Nasca, T. J., Cohen, M. J., Gonnella, J. S., Erdmann, J. B., Veloski, J., & Magee, M. (2001). The Jefferson Scale of Physician Empathy: Development and preliminary psychometric data. Educational and Psychological Measurement, 61(2), 349-365. https://doi.org/10.1177/00131640121971158

Inkster, B., Sarda, S., & Subramanian, V. (2018). An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation. JMIR mHealth and uHealth, 6(11), e12106. https://doi.org/10.2196/12106

Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., & Jurafsky, D. (2016). A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 110-119). https://doi.org/10.18653/v1/N16-1014

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop (pp. 74-81).

Majumder, N., Hong, P., Peng, S., Lu, J., Ghosal, D., Gelbukh, A., Mihalcea, R., & Poria, S. (2020). MIME: MIMicking emotions for empathetic response generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/2020.emnlp-main.721

Miller, A. H., Feng, W., Fisch, A., Lu, J., Batra, D., Bordes, A., Parikh, D., & Weston, J. (2017). ParlAI: A dialog research software platform. arXiv preprint arXiv:1705.06476. https://doi.org/10.18653/v1/D17-2014

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311-318). https://doi.org/10.3115/1073083.1073135

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1532-1543). https://doi.org/10.3115/v1/D14-1162

Rashkin, H., Smith, E. M., Li, M., & Boureau, Y.-L. (2019). Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5370-5381). https://doi.org/10.18653/v1/P19-1534

Rogers, C. R. (1957). The necessary and sufficient conditions of therapeutic personality change. Journal of Consulting Psychology, 21(2), 95-103. https://doi.org/10.1037/h0045357

Shum, H.-Y., He, X.-D., & Li, D. (2018). From Eliza to XiaoIce: Challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering, 19(1), 10-26. https://doi.org/10.1631/FITEE.1700826

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27 (pp. 3104-3112).

Vaidyam, A. N., Wisniewski, H., Halamka, J. D., Kashavan, M. S., & Torous, J. B. (2019). Chatbots and conversational agents in mental health: A review of the psychiatric landscape. Canadian Journal of Psychiatry, 64(7), 456-464. https://doi.org/10.1177/0706743719828977

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30 (pp. 5998-6008).

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38-45). https://doi.org/10.18653/v1/2020.emnlp-demos.6

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., & Weston, J. (2018). Personalizing dialogue agents: I have a dog; do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (pp. 2204-2213). https://doi.org/10.18653/v1/P18-1205

Published
2026-06-30
How to Cite
Zhang, Y., & Zhou, Z. (2026). Can Small Language Models Generate Therapist-Like Responses? A Lightweight Study of Therapist Imitation in Mental Health Support. Bulletin of Counseling and Psychotherapy, 8(2). https://doi.org/10.51214/002026081913000