#ai-training-data
#ai-training-data

fromSearch Engine Roundtable

5 days ago

Your smart TV may be crawling the web for AI

Bright Data offers streaming services an ad-free monetization alternative by converting smart TVs into residential proxies that collect web data for resale to AI companies.

6 days ago

Anthropic Updates Its Crawler Documentation: ClaudeBot, Claude-User & Claude-SearchBot

ClaudeBot helps enhance the utility and safety of our generative AI models by collecting web content that could potentially contribute to their training. When a site restricts ClaudeBot access, it signals that the site's future materials should be excluded from our AI model training datasets.

Privacy technologies

fromEntrepreneur

1 week ago

Most Founders Don't Realize They're Giving Away Their Influence - Here's How to Take It Back

Every search, purchase, loyalty swipe, location ping and scroll feeds systems that now shape pricing, product decisions, hiring and marketing strategies. Most founders understand this in theory, but few grasp the practical consequence: whether they intend to or not, they and their customers are already casting votes with their data. And those votes? They're usually cast passively, on someone else's terms.

Data science

#copyright

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

Plaintiffs Propose Plan for Landmark $1.5 Billion Copyright Settlement Process with Anthropic

Artificial intelligence

Music publishers sue Anthropic for $3 billion over 'flagrant piracy'

Intellectual property law

New York Times reporter files lawsuit against AI companies

fromLawSites

Intellectual property law

Thomson Reuters Tells Appeals Court: ROSS's Copying Was 'Theft, Not Innovation'

Germany news

ChatGPT violated copyright law by harvesting musicians' lyrics, German court rules

Munich court ruled ChatGPT violated German copyright by using protected song lyrics to train its models, ordering damages and allowing OpenAI the right to appeal.

fromBusiness Matters

fromIPWatchdog.com | Patents & Intellectual Property Law

AI firm Stability AI wins High Court case against Getty Images over copyright claims

High Court ruled Stability AI did not infringe copyright by training its model on Getty Images' photos, weakening copyright owners' control.

Intellectual property law

Plaintiffs Propose Plan for Landmark $1.5 Billion Copyright Settlement Process with Anthropic

Artificial intelligence

Music publishers sue Anthropic for $3 billion over 'flagrant piracy'

Intellectual property law

New York Times reporter files lawsuit against AI companies

fromLawSites

Intellectual property law

Thomson Reuters Tells Appeals Court: ROSS's Copying Was 'Theft, Not Innovation'

Germany news

ChatGPT violated copyright law by harvesting musicians' lyrics, German court rules

fromBusiness Matters

Intellectual property law

AI firm Stability AI wins High Court case against Getty Images over copyright claims

Social media marketing

Reddit INSIDER sends major vote of confidence after earnings

fromTheStreet

Artificial intelligence

Reddit INSIDER sends major vote of confidence after earnings

fromSocial Media Today

Artificial intelligence

Reddit Launches Legal Action to Block AI Companies from Scraping its Data

Tech industry

Reddit sues Perplexity and others for allegedly scraping millions of user comments

fromMacon Telegraph

Social media marketing

Reddit INSIDER sends major vote of confidence after earnings

fromTheStreet

Artificial intelligence

Reddit INSIDER sends major vote of confidence after earnings

fromSocial Media Today

Artificial intelligence

Reddit Launches Legal Action to Block AI Companies from Scraping its Data

Tech industry

Reddit sues Perplexity and others for allegedly scraping millions of user comments

Daily Tech Insider Maps the AI Arms Race From Silicon Valley to the Moon

Major tech companies are committing massive AI infrastructure spending, accelerating deployment, concentrating control, and driving job and market disruptions.

fromPetaPixel

Amazon May Launch Marketplace for Publishers to Sell Content to AI Firms

Amazon is exploring a content marketplace enabling publishers to license articles and data directly to AI companies to replace web scraping and monetize content.

#content-licensing

Artificial intelligence

Amazon may launch a marketplace where media sites can sell their content to AI companies | TechCrunch

fromCNET

Artificial intelligence

Online Media Brands Hope a New Protocol Will Stop Unwanted AI Crawlers

Artificial intelligence

Amazon may launch a marketplace where media sites can sell their content to AI companies | TechCrunch

fromCNET

Artificial intelligence

Online Media Brands Hope a New Protocol Will Stop Unwanted AI Crawlers

more#content-licensing

#web-scraping

Business

Increase of AI bots on the Internet sparks arms race

Artificial intelligence

Anthropic and OpenAI are crawling the web even more and not giving much back

Business

Increase of AI bots on the Internet sparks arms race

Artificial intelligence

Anthropic and OpenAI are crawling the web even more and not giving much back

Artificial intelligence

Anthropic Knew the Public Would Be Disgusted by How It Was Destroying Physical Books, Secret Documents Reveal

Intellectual property law

Judge puts Anthropic's $1.5 billion book piracy settlement on hold

Artificial intelligence

Anthropic Knew the Public Would Be Disgusted by How It Was Destroying Physical Books, Secret Documents Reveal

Intellectual property law

Judge puts Anthropic's $1.5 billion book piracy settlement on hold

more#anthropic

Video game company stock prices dip after Google introduces an AI world-generation tool

The stock prices of some major video game companies, including Take-Two Interactive, Roblox, and Unity, had notable declines on Friday, just a day after Google announced its Project Genie tool that lets users prompt AI to generate interactive experiences, Reuters reports. Take-Two's stock price closed at $220.30 (down 7.93 percent from yesterday), Roblox's closed at $65.76 (down 13.17 percent), and Unity's closed at $29.10 (down 24.22 percent).

Video games

#copyright-infringement

fromEntrepreneur

Intellectual property law

Anthropic Is Being Sued for $3 Billion Over Music Piracy

Intellectual property law

YouTubers sue Snap for alleged copyright infringement in training its AI models | TechCrunch

Intellectual property law

John Carreyrou and other authors bring new lawsuit against six major AI companies | TechCrunch

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

Studio Ghibli, Bandai Namco, Square Enix demand OpenAI stop using their content to train AI

Apple

Authors Take Page from Anthropic in Alleging Apple Infringed Works by Training AI on Pirated Books

Artificial intelligence

OpenAI in Danger After Authors Suing It Gain Access to Its Internal Slack Messages

fromEntrepreneur

Intellectual property law

Anthropic Is Being Sued for $3 Billion Over Music Piracy

Intellectual property law

YouTubers sue Snap for alleged copyright infringement in training its AI models | TechCrunch

Intellectual property law

John Carreyrou and other authors bring new lawsuit against six major AI companies | TechCrunch

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

Studio Ghibli, Bandai Namco, Square Enix demand OpenAI stop using their content to train AI

Apple

Authors Take Page from Anthropic in Alleging Apple Infringed Works by Training AI on Pirated Books

more#copyright-infringement

Artificial intelligence

OpenAI in Danger After Authors Suing It Gain Access to Its Internal Slack Messages

#internet-archive

Media industry

Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround

fromNieman Lab

Media industry

News publishers limit Internet Archive access due to AI scraping concerns

Media industry

Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround

fromNieman Lab

Media industry

News publishers limit Internet Archive access due to AI scraping concerns

more#internet-archive

fromBuzzFeed

If You Use Gmail, You're Going To Want To Turn Off This 1 Automatic Setting ASAP

For Gmail users, there is an automatic opt-in that may allow Google access to your emailed data (think: your personal and work messages, your attachments) "to train AI models," cybersecurity experts allege. If you don't want this information shared, you need to adjust your settings. In the race for companies to get an ROI on AI, we're already seeing language learning models running out of new, human-generated data to train on.

fromIPWatchdog.com | Patents & Intellectual Property Law

fromGlobal IP & Technology Law Blog

Other Barks & Bites for Friday, January 23: USAA Petition on Section 101 Distributed for Conference; Fifth Circuit Says Trade Secret Claimants Must Apportion Damages; TRAIN Act Introduced in House

New U.S. IP developments: TRAIN Act proposes subpoena power for AI training data; courts and agencies advance major trademark, patent, antitrust, and trade-secret rulings.

A Year On from UK Government Consultation on Copyright and Artificial Intelligence

those options range from "option 0", simply doing nothing and leaving UK copyright legislation in its currently uncertain state when it comes to the use of copyright materials to train AI models, through to options which would either require specific consent from rights holders in all cases ("option 1") or allow consent to be assumed by AI developers unless a rights holder objects, subject to developers being transparent about what materials have been used in training ("option 3").

UK politics

After Being Pillaged By AI Companies, Wikipedia Signs Deal to Get Paid By Them

Wikipedia is licensing its collection of over 65 million articles to major AI companies through a paid Enterprise program to recoup costs and fund operations.

fromAxios

The rise of "web rot"

Older websites persist and degrade search quality and training data, while overall web traffic steadiness masks decline among sites older than five years.

World's largest shadow library made a 300TB copy of Spotify's most streamed songs

Anna's Archive is offering high-speed, enterprise-level access to scraped LLM training data including unreleased collections, raising concerns about facilitating AI labs and legal exposure.

Music

Activist group says it has scraped 86m music files from Spotify

Anna's Archive claims to have scraped 86 million Spotify tracks and metadata, planning to release them online and potentially accelerate AI training on pirated music.

Adobe hit with proposed class-action, accused of misusing authors' work in AI training | TechCrunch

A proposed class-action lawsuit filed on behalf of Elizabeth Lyon, an author from Oregon, claims that Adobe used pirated versions of numerous books-including her own-to train the company's SlimLM program. Adobe describes SlimLM as a small language model series that can be "optimized for document assistance tasks on mobile devices." It states that SlimLM was pre-trained on SlimPajama-627B, a "deduplicated, multi-corpora, open-source dataset" released by Cerebras in June of 2023.

Artificial intelligence

Miscellaneous

fromeuronews

EU vs. Big Tech: What actions have regulators taken so far?

European regulators are enforcing new AI, digital services, and markets laws to curb Big Tech dominance and protect consumers and creators.

Startup companies

Who's making the most money in AI? It's not who you think

Emerging vendors like Mercor and Handshake profit massively by supplying specialized data, engineers, and labeling services to frontier AI labs pursuing AGI.

India's government wants AI companies to pay for content

India proposes blanket training licenses for AI with royalties paid only upon commercialization, set by a government committee and collected via a centralized nonprofit collective.

#rsl

Intellectual property law

Really Simple Licensing spec makes AI orgs pay to scrape

Media industry

Publishers are fighting back against AI with a new web protocol - is it too late?

Intellectual property law

Really Simple Licensing spec makes AI orgs pay to scrape

Media industry

Publishers are fighting back against AI with a new web protocol - is it too late?

Miscellaneous

Google faces a new EU antitrust probe over content used for AI Overviews, YouTube

fromComputerworld

EU data protection

European Commission investigates Google's AI training processes

Miscellaneous

Google Zero is under investigation by the EU

EU probes Google for allegedly using publisher and YouTube content to boost its AI features without offering compensation or opt-out options, risking anti-competitive harm.

Europe politics

EU opens investigation into Google's use of online content for AI models

The EU is investigating whether Google used web publishers' and YouTube content to train AI unfairly, disadvantaging rival AI developers and content creators.

Miscellaneous

Google faces a new EU antitrust probe over content used for AI Overviews, YouTube

fromComputerworld

EU data protection

European Commission investigates Google's AI training processes

Miscellaneous

Google Zero is under investigation by the EU

Europe politics

EU opens investigation into Google's use of online content for AI models

Miscellaneous

EU launches Google antitrust probe over AI training

Miscellaneous

EU opens antitrust investigation into Google's AI practices

Miscellaneous

EU launches Google antitrust probe over AI training

Miscellaneous

EU opens antitrust investigation into Google's AI practices

more#antitrust

Publishers say no to AI scrapers, block bots at server level

Millions of websites are blocking AI crawler bots via robots.txt to prevent training-data scraping and reduce non-human server traffic.

Startup companies

Micro1, a Scale AI competitor, touts crossing $100M ARR | TechCrunch

Micro1 grew ARR from roughly $7M to over $100M this year by rapidly recruiting and vetting domain experts to supply human training data for AI labs and enterprises.

Google denies analyzing your emails for AI training - here's what happened

I contacted Google for comment, and a spokesperson sent me the following statement: "These reports are misleading - we have not changed anyone's settings. Gmail Smart Features have existed for many years, and we do not use your Gmail content for training our Gemini AI model. Lastly, we are always transparent and clear if we make changes to our terms of service and policies."

Privacy professionals

EU data protection

fromwww.dw.com

EU plans to ease GDPR laws and AI constraints in major shift DW 11/18/2025

EU proposals would narrow GDPR protections, enable broader data harvesting for AI, remove cookie consent pop-ups, and shift burden onto users to request data removal.

fromFortune

Cloudflare CEO says Google is abusing its monopoly in search to feed its AI | Fortune

"The great patron of the internet for the last 27 years was Google. The great villain of the internet today is also Google," Prince said. He claimed that in the past, for every two pages that Google crawled to inform its search engine, it would, on average, send one visitor to those sites-traffic that publishers can monetise with advertising.

Artificial intelligence

#copyright-law

Germany news

Court rules that OpenAI violated German copyright law; ordered it to pay damages | TechCrunch

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

Labor rules out giving tech giants free rein to mine copyright content to train AI

fromElectronic Frontier Foundation

Intellectual property law

Anthropic Settlement Signals AI Innovation Can Thrive Within Existing Copyright Framework

Intellectual property law

Protecting Access to the Law-and Beneficial Uses of AI

Copyright claims against AI training on proprietary legal headnotes threaten beneficial legal-research tools and public access to case law.

UK politics

Adviser to UK minister claimed AI firms would never have to compensate creatives

AI companies will never legally have to compensate creatives for using their content to train their systems.

Germany news

Court rules that OpenAI violated German copyright law; ordered it to pay damages | TechCrunch

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

Labor rules out giving tech giants free rein to mine copyright content to train AI

fromElectronic Frontier Foundation

Intellectual property law

Anthropic Settlement Signals AI Innovation Can Thrive Within Existing Copyright Framework

Intellectual property law

Protecting Access to the Law-and Beneficial Uses of AI

UK politics

Adviser to UK minister claimed AI firms would never have to compensate creatives

more#copyright-law

fromTechzine Global

Wikimedia calls on AI companies to use paid API

Wikimedia has called on AI companies to take responsibility for using Wikipedia content in their language models. This can be achieved by stopping scraping and using the paid API instead. In a blog post, the organization states that artificial intelligence cannot exist without the human knowledge collected and maintained on platforms such as Wikipedia. To maintain that balance, Wikimedia asks developers of generative AI to clearly cite their sources and contribute to the continued existence of the open knowledge project via the paid Wikimedia Enterprise platform.

Artificial intelligence

Elon Musk's Grokipedia launches with AI-cloned pages from Wikipedia

Since 2001, Wikipedia has been the backbone of knowledge on the internet. Hosted by the Wikimedia Foundation, it remains the only top website in the world run by a nonprofit. Unlike newer projects, Wikipedia's strengths are clear: it has transparent policies, rigorous volunteer oversight, and a strong culture of continuous improvement. Wikipedia is an encyclopedia, written to inform billions of readers without promoting a particular point of view.

Non-profit organizations

fromABC7 Los Angeles

Elon Musk launches Grokipedia to compete with online encyclopedia Wikipedia

Elon Musk launched Grokipedia, a crowdsourced encyclopedia powered by xAI, presenting itself as a minimalist Wikipedia rival claiming to provide the complete truth.

How AI labs use Mercor to get the data companies won't share | TechCrunch

AI labs hire former senior employees through marketplaces like Mercor to obtain industry workflows and train automation models without corporate data contracts.

fromComputerworld

fromIPWatchdog.com | Patents & Intellectual Property Law

Canva debuts foundational 'design' model, extends AI tools across its app

Canva has built its own foundational AI model that generates layered designs users can edit more easily. It's one of several generative AI-related features Canva announced Thursday, alongside expanded access to its AI assistant and content generation capabilities across its app. To date, Canva has partnered with a variety of AI model providers for content generation - Black Forest Labs, Google, and OpenAI among them - and it acquired Leonardo AI last year.

Artificial intelligence

#data-scraping

Artificial intelligence

Reddit Dubs Perplexity AI and Data Scraping Companies 'Would-Be Bank Robbers'

fromThe Mercury News

Artificial intelligence

Reddit sues AI company Perplexity and others for 'industrial-scale' scraping of user comments

fromAdExchanger

Tech industry

Sour Scrapes; (Anti)-trust The Process | AdExchanger

fromIPWatchdog.com | Patents & Intellectual Property Law

Artificial intelligence

Reddit drags Perplexity in a new lawsuit, accusing it of building up a $20 billion company off stolen data

Artificial intelligence

Reddit Dubs Perplexity AI and Data Scraping Companies 'Would-Be Bank Robbers'

fromThe Mercury News

Artificial intelligence

Reddit sues AI company Perplexity and others for 'industrial-scale' scraping of user comments

fromAdExchanger

Tech industry

Sour Scrapes; (Anti)-trust The Process | AdExchanger

Artificial intelligence

Reddit drags Perplexity in a new lawsuit, accusing it of building up a $20 billion company off stolen data

Facebook's new button lets its AI look at photos you haven't uploaded yet

Meta's opt-in camera-roll feature uploads unpublished photos to its cloud, suggests edits, and can use edited or shared images to train its AI.

Silicon Valley

Scale AI agreed to settle multiple lawsuits from its California contractors

Scale AI agreed to settle four California lawsuits alleging worker misclassification, underpayment, and denied benefits and has stopped hiring California gig workers.

Your Uber driver has a new side hustle: Training AI for cash

According to Uber, beginning later this year, drivers and couriers who opt into the program can complete "digital tasks" within Uber's Driver app. These tasks can include submitting a video of themselves speaking in their native language, uploading pictures of specific everyday items, or presenting documents written in a different language. After tasks are completed, the earnings will be in the users' balance within 24 hours. Compensation depends on the time commitment to complete tasks and their complexity.

Artificial intelligence

Privacy technologies

fromExchangewire

Verve Study Shows That 75% of Consumers are More Open to Watching Ads for Free Content

Consumers increasingly accept ad-supported content while expressing rising concern about data use, especially for AI training.

Inside the web infrastructure revolt over Google's AI Overviews

The new change, which Cloudflare calls its Content Signals Policy, happened after publishers and other companies that depend on web traffic have cried foul over Google's AI Overviews and similar AI answer engines, saying they are sharply cutting those companies' path to revenue because they don't send traffic back to the source of the information. There have been lawsuits, efforts to kick-start new marketplaces to ensure compensation, and more-

Tech industry

Science

fromNature

How stereotypes shape AI - and what that means for the future of hiring

Internet images encode gendered stereotypes: women shown younger and linked to caregiving jobs, men linked to leadership roles, embedding bias in AI training and hiring.

Privacy technologies

Anker offered Eufy camera owners $2 per video for AI training | TechCrunch

Anker paid Eufy users $2 per theft video and encouraged staged recordings to build AI training data, creating privacy and security risks.

AI has already run out of training data - but there's more waiting to be unlocked, Goldman's data chief says

"We've already run out of data," Neema Raphael, Goldman Sachs' chief data officer and head of data engineering, said on the bank's "Exchanges" podcast published on Tuesday.

Artificial intelligence

Privacy professionals

Anker offered to pay Eufy camera owners to share videos for training its AI | TechCrunch

Companies pay users for camera and call recordings to train AI models, creating value for users but introducing significant privacy and security risks.

#call-recording

Privacy professionals

Neon, a buzzy app that pays to record your calls for AI training data, goes offline to address a security scandal

Privacy technologies

Neon, the No. 2 social app on the Apple App Store, pays users to record their phone calls and sells data to AI firms | TechCrunch

Privacy professionals

Neon, a buzzy app that pays to record your calls for AI training data, goes offline to address a security scandal

fromstupidDOPE | Est. 2008

Privacy technologies

Neon, the No. 2 social app on the Apple App Store, pays users to record their phone calls and sells data to AI firms | TechCrunch

more#call-recording

The Future of Content Licensing: How RSL Bridges Publishers and AI | stupidDOPE | Est. 2008

For decades, publishers large and small have created the news, culture, entertainment, and educational resources that shape how society consumes information. Yet in recent years, the rise of artificial intelligence has added a new twist to the ongoing struggle for sustainable publishing. AI companies are building tools capable of generating responses, summaries, and insights trained on vast amounts of web content. The problem? Many publishers see little to no compensation for their role in shaping the data that fuels these systems.

Artificial intelligence

Information security

Exclusive: Neon takes down app after exposing users' phone numbers, call recordings, and transcripts

Neon, an app paying users for call recordings to sell to AI firms, exposed users' phone numbers, recordings, and transcripts through a security flaw.

fromTech.co

How to Stop LinkedIn From Using Your Data to Train Its AI Models

When Is LinkedIn Going to Start Using My Data to Train Its AI Models? LinkedIn has announced that it will start using some of its users' data to train its AI models starting on November 3rd, 2025. Users from the EU, EEA, Switzerland, Canada, and Hong Kong will all be affected. At this stage, US users will not be affected, but this could soon change.

EU data protection

Publishers are finally going after Google. What happens now?

Media companies have filed so many lawsuits against AI companies over the past two years that the act has become routine. When I report on these in The Media Copilot newsletter, they're often digest items, adding to the pile of publishers who want fair compensation for the content AI labs have ingested to create large language models (LLMs). There are so many that elaborate infographics are required to keep track of them all.

Media industry

Record labels claim AI generator Suno illegally ripped their songs from YouTube

Major record labels accuse Suno of pirating songs from YouTube to train AI music models, alleging circumvention of YouTube protections and violations of the DMCA.

Business

from24/7 Wall St.

3 Growth Stocks to Buy If You Only Have $10,000

Deploy $10,000 into selective growth stocks, prioritizing firms increasing revenue and margins—like Reddit—expecting AI-driven demand to amplify long-term returns.

Google's idea for fixing the AI data drought? Cleaning up risky data.

Generative Data Refinement (GDR) rewrites unsafe, toxic, or PII-containing text using pretrained generative models to purify it for AI training.

Micro1, a competitor to Scale AI, raises funds at $500M valuation | TechCrunch

Micro1 raised $35 million Series A at a $500M valuation while rapidly growing ARR to $50M, positioning to supply human-labeled data for AI labs.

fromBusiness Matters

TikTok tops list of most scraped websites as AI training reshapes data priorities

TikTok became the world’s most scraped website in 2025 after a 321% surge, reflecting AI-driven demand for multimodal training data.

fromFortune

Google's AI is the 'worst' for stealing content, says People CEO | Fortune

When Google became the dominant search engine around 2004, not everyone was happy. Everyone from book publishers to music studios blasted the company for helping itself to copyrighted content without paying. The search giant eventually smoothed things over but now, twenty years later, Google has become the media industry's villain all over again-this time for gobbling that same content to train its AI tools.

Artificial intelligence