The next time you come across an exceptionally polite response on social media, take a closer look. It might just be an AI algorithm struggling to blend in.
On Wednesday, researchers from the University of Zurich, University of Amsterdam, Duke University, and New York University published a study indicating that AI models are still clearly identifiable in online conversations. The primary giveaway is their consistently friendly tone. Testing nine different AI models across platforms such as Twitter/X, Bluesky, and Reddit, the researchers discovered their AI classifiers correctly identified AI-generated replies with an accuracy of 70 to 80 percent.
The study introduced a ācomputational Turing testā that evaluates how well AI mimics human language. Instead of relying on human opinion about text authenticity, this framework uses automated classifiers and linguistic analysis to pinpoint features that differentiate machine-generated content from human input.
āEven after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression,ā the researchers, led by Nicolò Pagan from the University of Zurich, noted. Despite trying various optimization strategies, including simple prompting and fine-tuning, emotional cues remained strong indicators that an AIārather than a humanāauthored a text interaction.
The Toxicity Tell
During the study, nine large language models were tested, including Llama 3.1 8B, Llama 3.1 8B Instruct, Llama 3.1 70B, Mistral 7B v0.1, Mistral 7B Instruct v0.2, Qwen 2.5 7B Instruct, Gemma 3 4B Instruct, DeepSeek-R1-Distill-Llama-8B, and Apertus-8B-2509.
Tasked with generating responses to real social media posts, these models struggled to replicate the typical casual negativity and spontaneous emotional expressions seen in human-authored posts, consistently scoring lower in toxicity across all three platforms examined.
To address these shortcomings, the researchers experimented with optimization strategies, such as providing writing examples and context retrieval, which helped reduce structural differences like sentence length or word count. However, variations in emotional tone remained. The researchers concluded, āOur comprehensive calibration tests challenge the assumption that more sophisticated optimization necessarily yields more human-like output.ā