News publishers are blocking the Internet Archive's Wayback Machine

"The Internet Archive has preserved more than one trillion web pages since 1996. Courts cite it. Journalists use it to prove articles were edited after publication. Historians treat it as a primary source."

"AI companies training large language models need vast quantities of high-quality text. Archived news content is exactly that: structured, dated, attributed, high-quality writing accumulated over decades."

"In total, 241 news sites across nine countries explicitly disallow at least one of the Archive's four crawling bots. USA Today Co., the largest newspaper publisher in the US, accounts for a large share of the blocked sites."

"The New York Times implemented what Wayback Machine director Mark Graham described as a 'hard block' starting in late 2025."

Over 241 news organizations across nine countries have restricted the Internet Archive's crawlers, impacting its ability to preserve web pages. The Internet Archive has archived over one trillion web pages since 1996, serving as a vital resource for journalists and historians. However, news publishers are concerned about AI companies using their archived content for model training without consent. Major publications, including USA Today and The New York Times, have implemented blocks on the Archive's crawlers, limiting access to historical records.

#internet-archive #news-organizations #ai-training #web-preservation #crawlers

Read at TNW | Media

Unable to calculate read time

Collection

[

...

]

News publishers are blocking the Internet Archive's Wayback MachineNews publishers are blocking the Internet Archive's Wayback Machine Briefly

News publishers are blocking the Internet Archive's Wayback Machine
News publishers are blocking the Internet Archive's Wayback Machine
Briefly