AI Is Failing at the Most Hilarious Task Imaginable

"To conduct the yet-to-be-peer-reviewed study, the researches applied what they call a "computational Turing test" to posts LLMs on X-formerly-Twitter, Reddit, and Bluesky. They found that posts generated by AI bots - all open-weight models, ranging from DeepSeek to Qwen - were all "readily distinguishable" from ones by human users with a 70-80 percent accuracy rate, which is "well above [the threshold for] chance.""

"One of the major reasons for this, the scholars posit, is that AI can only mimic a human's emotional depth, what we might call the "heat of the moment" vitriol of a typical flame war. When we get into it, we really get into it, with a level of both "toxicity" and "sentiment" that remain unmistakably human. "Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression," the team wrote."

Researchers from Switzerland, the Netherlands, and the US applied a "computational Turing test" to LLM-generated posts on X (formerly Twitter), Reddit, and Bluesky. AI-generated posts from open-weight models such as DeepSeek and Qwen were distinguishable from human posts with roughly 70–80% accuracy, well above chance. The primary shortcoming identified is affective realism: models struggle to reproduce the emotional intensity, "heat of the moment" vitriol, toxicity, and sentiment that characterize human flame wars. Model scale did not guarantee more realistic vitriol; larger models like Llama-3.1-70B performed on par with or below smaller models.

#ai-generated-content #social-media #large-language-models #affective-tone

Read at Futurism

Unable to calculate read time

Collection

[

...

]

AI Is Failing at the Most Hilarious Task ImaginableAI Is Failing at the Most Hilarious Task Imaginable Briefly

AI Is Failing at the Most Hilarious Task Imaginable
AI Is Failing at the Most Hilarious Task Imaginable
Briefly