Reddit Sues Perplexity Over Alleged Data Scraping for AI Training

Reddit has escalated the debate over data ownership in artificial intelligence by suing Perplexity AI and three other firms — Oxylabs, AWMProxy, and SerpApi — in New York federal court. The lawsuit accuses them of illegally scraping Reddit’s user-generated content and using it to train AI models without permission.

Reddit’s legal action, filed on October 22, 2025, underscores growing industry conflict between platforms that host user content and AI developers eager to feed data-hungry models. The complaint alleges that the companies used automated systems to mimic human browsing behavior, bypassing security protocols and violating copyright protections.

“AI companies are locked in an arms race for quality human content — and that pressure has fueled an industrial-scale ‘data laundering’ economy,” said Ben Lee, Reddit’s Chief Legal Officer.

Key Features and Allegations in the Lawsuit

According to the filing, Reddit claims that Perplexity:

  • Accessed Reddit discussions through proxy servers to avoid detection.
  • Used third-party scraping companies to disguise its activities.
  • Dramatically increased its use of Reddit data — allegedly 40 times higher — after receiving a cease-and-desist notice in mid-2025.

The lawsuit further alleges that Perplexity’s platform reproduces Reddit content in AI-generated responses, effectively “repurposing” user posts as training material.

AspectDetails from Reddit’s Complaint
CourtU.S. District Court, Southern District of New York
DefendantsPerplexity AI, Oxylabs, SerpApi, AWMProxy
Main AllegationUnauthorized scraping and use of Reddit data for AI model training
Key ClaimViolation of copyright and circumvention of Reddit’s terms of service
Related Prior CaseReddit v. Anthropic (June 2025)
Requested RemedyInjunction and monetary damages

Perplexity’s Response

Perplexity strongly denied the accusations, describing the lawsuit as a “show of force” meant to strengthen Reddit’s licensing leverage with OpenAI and Google.

In a statement posted on Reddit, Perplexity said:

“We do not train AI models on Reddit’s content. Our service only summarizes publicly available information and provides citations. It is impossible to license what we don’t store or use.”

The company argued that it accesses data lawfully, through open web protocols, and that Reddit’s data is public — therefore, not subject to exclusive licensing restrictions.

Related Companies and Their Responses

Oxylabs

Oxylabs, a Lithuania-based data infrastructure company, said it was “shocked and disappointed” by Reddit’s legal action, noting that Reddit did not reach out before filing suit.

“Oxylabs has always been a pioneer in ethical public data collection,” said Denas Grybauskas, Chief Governance and Strategy Officer. “No company should claim ownership of public data that does not belong to them.”

SerpApi

SerpApi, which provides real-time search result APIs, said it “strongly disagrees” with Reddit’s allegations and plans to defend itself vigorously in court.

AWMProxy

AWMProxy, a Russia-based proxy service provider, has not issued a statement or been reachable for comment.

Reddit’s Broader Strategy on AI Licensing

This isn’t Reddit’s first legal confrontation over AI training. The platform sued Anthropic in June 2025, alleging similar scraping behavior.

The company has since sought to monetize access to its vast database, which spans over 100,000 communities and billions of discussions. In 2024 and 2025, Reddit struck multi-million-dollar licensing deals with OpenAI and Google, giving them permission to train models on Reddit’s content.

These AI partnerships now account for roughly 10% of Reddit’s total revenue, according to Chief Operating Officer Jen Wong.

Reddit’s AI Licensing Snapshot (2025)Details
PartnersOpenAI, Google
Licensing Revenue~10% of total company earnings
Content IncludedPublic discussions and moderated subreddits
GoalTo ensure fair compensation for human-generated data

Data Scraping: A Growing Industry Challenge

Data scraping — the automated extraction of publicly available information — has become a flashpoint in AI development. Platforms like Reddit, Twitter (X), and LinkedIn have begun restricting API access or charging fees to protect their data from being used without compensation.

Experts say that while scraping itself isn’t always illegal, using scraped data for AI training may cross regulatory boundaries if it violates copyright, licensing, or privacy laws.

“The tension lies in whether ‘public’ means ‘free for commercial use,’” said Laura Hendricks, a technology law professor at Georgetown University. “Courts are now being asked to draw that line — something that will define AI ethics for years to come.”

Industry Implications and Expert Insights

The Reddit-Perplexity case is emblematic of a broader battle in the AI data economy, where platforms, developers, and regulators are struggling to establish fair-use boundaries.

  1. Content Ownership vs. AI Innovation – Social platforms are asserting control over user-generated data to protect both user privacy and potential revenue.
  2. Licensing as a Revenue Model – Platforms like Reddit and Stack Overflow are now monetizing data access as AI companies compete for high-quality conversational data.
  3. Legal Precedents Emerging – Courts are expected to set landmark rulings that clarify whether web data can be freely used for AI model training.
  4. Transparency in AI – AI developers may face mounting pressure to disclose the sources of their training datasets.

“Reddit’s lawsuit could shape the next decade of AI data regulation,” said Michael Carter, policy director at the Center for AI Governance. “If courts side with Reddit, AI companies will be forced to rethink how they acquire and license training data.”

Why It Matters?

Reddit’s complaint highlights the economic and ethical stakes of the AI boom: who controls the data that fuels large language models?

If Reddit succeeds, the decision could strengthen data licensing markets and incentivize AI companies to negotiate directly with content providers. But if Perplexity prevails, it could reinforce the idea that public web data remains open territory for machine learning — a precedent with vast implications for future innovation and intellectual property law.

“This is about drawing the boundary between fair use and exploitation,” said Ben Lee of Reddit. “Platforms built on human creativity deserve fair value for their contributions to AI.”

FAQs

What is Reddit accusing Perplexity of doing?

Reddit claims Perplexity and its partners scraped Reddit content without authorization and used it to train or generate AI models.

Who else is named in the lawsuit?

The case also names Oxylabs, AWMProxy, and SerpApi for allegedly aiding in data collection efforts.

Has Perplexity responded to the lawsuit?

Yes. Perplexity denies the allegations, asserting it does not train models on Reddit data and only summarizes public content.

What makes Reddit’s data valuable?

Reddit hosts one of the world’s largest repositories of user discussions, making it a prized dataset for training conversational AI models.

Why is this case important for the AI industry?

It could set a legal precedent on whether publicly accessible data can be freely used for AI training or must be licensed from content owners.

What’s next for the case?

The U.S. District Court in Manhattan will review Reddit’s claims. Proceedings may take months, with implications for other pending AI-related lawsuits.

Leave a Comment