OpenAI Faces Scrutiny Over Deleted Datasets in Legal Dispute with Authors

OpenAI may soon have to clarify its rationale for removing two contentious datasets comprised of pirated books, as the stakes of an ongoing class-action lawsuit are high.

Central to a legal battle initiated by authors, who claim that ChatGPT was trained unlawfully on their works, is OpenAI’s decision to eliminate the datasets, a move that could potentially tilt the case in favor of the authors.

It is an established fact that the datasets, known as “Books 1” and “Books 2,” were deleted before ChatGPT's 2022 release. These datasets were created by former OpenAI employees in 2021, primarily by scraping data from a shadow library known as Library Genesis (LibGen).

OpenAI maintains that the datasets were no longer used by that same year, prompting their removal internally.

However, the authors suspect there might be more underlying reasons. They pointed out OpenAI's contradictions, as the company initially retracted its assertion that “non-use” justified the deletion, later claiming that all reasons, including “non-use,” should be protected by attorney-client privilege.

This perceived inconsistency, following a court-mandated discovery request, has heightened the authors’ interest in uncovering how OpenAI detailed “non-use.”

Recently, US District Judge Ona Wang directed OpenAI to disclose all communications with its in-house legal team regarding the datasets' deletion, along with any internal references to LibGen that OpenAI has kept under attorney-client privilege.

Judge Wang highlighted OpenAI's error in concurrently denying “non-use” as a deletion reason while asserting it as privileged.

← Back to News