AI Struggles to Mimic Human Tone in Social Media Interactions, Study Finds

The next time you see an extraordinarily polite response on social media, it might be worth taking a second look. It could be an artificial intelligence model trying, albeit unsuccessfully, to fit in.

On Wednesday, a study published by researchers from the University of Zurich, University of Amsterdam, Duke University, and New York University highlighted that AI models are still notably distinguishable from humans during social media interactions. The most common indicator of AI-generated content is an excessively friendly emotional tone. They tested nine open-weight models on platforms like Twitter/X, Bluesky, and Reddit, and found their classifiers could spot AI responses with an accuracy of 70 to 80 percent.

The study introduces a “computational Turing test” designed to evaluate how closely AI models can replicate human language. Unlike methods that depend on human judgment to evaluate authenticity, this framework uses automated classifiers and linguistic analysis to isolate specific features that set machine-generated content apart from human-authored text.

“Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression,” wrote the research team, led by Nicolò Pagan at the University of Zurich. They explored various optimization methods, ranging from simple prompting to fine-tuning, and discovered that deeper emotional cues continue to indicate that a piece of digital interaction was crafted by an AI chatbot rather than a human.

The Toxicity Tell

As part of the research, nine large language models were tested: Llama 3.1 8B, Llama 3.1 8B Instruct, Llama 3.1 70B, Mistral 7B v0.1, Mistral 7B Instruct v0.2, Qwen 2.5 7B Instruct, Gemma 3 4B Instruct, DeepSeek-R1-Distill-Llama-8B, and Apertus-8B-2509.

When these models were tasked with generating responses to real social media posts from authentic users, they struggled to replicate the casual negativity and spontaneous emotional expression typical of human posts. Across all three platforms, AI-generated replies had consistently lower toxicity scores compared to genuine human responses.

To address this shortcoming, researchers tried optimization techniques, including providing example writings and context retrieval, to minimize structural differences like sentence length or word count. Nonetheless, variations in emotional tone remained evident. “Our comprehensive calibration tests challenge the assumption that more sophisticated optimization necessarily yields more human-like output,” the researchers concluded.