
"A research team based in China used the Claude 2.0 large language model (LLM), created by Anthropic, an AI company in San Francisco, California, to generate peer-review reports and other types of documentation for 20 published cancer-biology papers from the journal eLife. The journal's publisher makes papers freely available online as 'reviewed preprints', and publishes them alongside their referee reports and the original unedited manuscripts. The authors fed the original versions into Claude and prompted it to generate referee reports."
"The AI-written reviews "looked professional, but had no specific, deep feedback", says Lingxuan Zhu, an oncologist at the Southern Medical University in Lianyungang, China, and a co-author of the study. "This made us realize that there was a serious problem." The study found that Claude could write plausible citation requests (suggesting papers that authors could add to their reference lists) and convincing rejection recommendations (made when reviewers think a journal should reject a submitted paper). The latter capability raises the risk of journals rejecting good papers, says Zhu."
"The study also found that the majority of the AI reports fooled the detection tools: ZeroGPT erroneously classified 60% as written by a human, and GPTzero concluded this for more than 80%. Differing opinions A growing challenge for journals is the fact that LLMs could be used in many ways to produce a referee report. What is deemed an 'acceptable' use of AI also differs depending on whom you ask."
Claude 2.0 generated referee reports for 20 published eLife cancer-biology papers by using the original unedited manuscripts as prompts. The AI reviews appeared professional but lacked specific, deep feedback. Claude produced plausible citation requests and persuasive rejection recommendations that could influence editors who are not subject-matter experts. Common AI-detection tools frequently misclassified the AI reports as human-written: ZeroGPT labeled 60% as human and GPTzero labeled more than 80% as human. The range of possible LLM uses for referee reports complicates judgments about acceptable AI use in peer review.
Read at Nature
Unable to calculate read time
Collection
[
|
...
]