In a filed Friday, the parties told the court they "remain at impasse" following the plaintiffs' request that Google hand over a number of datasets derived from "web crawled sources" in their effort to obtain discovery related to the entire training process for Google's AI products.
"If Google acquired and placed these corpora into its training pipeline, i.e., used them for pre-training, mid-training, post-training (fine-tuning), and ablation studies (testing how training data effects model performance), then these corpora were used to build and train the models," the artists and writers said. "This discovery is thus tethered directly to plaintiffs' allegations."
The artists and writers argued the court has ordered Google to make available the datasets used in training its models, noting that they were not aware of the datasets, nor that they contained copyrighted material, until a deposition last month.
Google, however, has refused to make these newly requested datasets available because it argues the case is only about pretraining, and the requested information was never used to train any of the models at issue, according to the letter.
"This is yet another sideshow," the tech giant said. "Plaintiffs reach beyond the four corners of the operative complaint (which [U.S. District Judge Eumi K. Lee] has foreclosed from further amendment) by making an untimely demand for additional, massive datasets, all but one of which were never even used to train models at issue in the case."
"Their demand would impose substantial undue burden and flout the court-endorsed, negotiated framework for training-data discovery," it added.
Google argued the artists' discovery "demand is grossly disproportionate," rejecting their assertion that production of the datasets is not burdensome. The tech company noted in the letter that in order for datasets to be produced for discovery, they must be located and converted into reviewable formats, among other steps.
It said such a process "has consumed many hundreds of engineering hours and resulted in substantial, continuing storage and compute costs to facilitate plaintiffs' access and process resource-intensive queries."
Representatives for the parties did not immediately respond to requests for comment Monday.
The discovery dispute comes nearly a month after Judge Lee trimmed the consolidated proposed class action and dismissed Google's parent company, Alphabet, altogether, finding the artists and writers plausibly alleged copyright infringement as to six out of 16 of Google's AI products: PaLM, GLaM, LaMDA, Bard, Gemini and Imagen.
However, she said plaintiffs failed to plausibly allege copyright infringement as to 10 other Google AI products cited in the complaint: Codey, Chirp, Veo, MedLM, LearnLM, SecLM, Gemma, CodeGemma, RecurrentGemma and PaliGemma.
"Plaintiffs do not allege any facts regarding these models at all," Judge Lee said in September. "These models are referenced only one time in the complaint, in a long list of what appears to be every generative AI model that Google has ever developed. Because plaintiffs do not allege that any of their works were included in training datasets used to develop these models, plaintiffs do not plausibly allege copyright infringement."
Judge Lee also concluded that the plaintiffs did not show Alphabet had the "practical ability" to control Google's alleged infringement to be vicariously liable and that Alphabet lacks controlling authority over the allegedly infringing conduct as Google's parent company.
The high-stakes litigation was launched in July 2023, alleging Google was "secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans," including millions of registered copyrighted works, to build its generative AI, large language models.
The plaintiffs are represented by Joseph R. Saveri, Cadio Zirpoli, Christopher K.L. Young, Evan A. Creutz, Elissa A. Buchanan, Aaron J. Cera, Louis A. Kessler and Alex Y. Zeng of Joseph Saveri Law Firm LLP, Lesley E. Weaver, Anne K. Davis, Joshua D. Samra and Gregory S. Mullens of Bleichmar Fonti & Auld LLP, Ryan J. Clarkson, Yana Hart, Mark I. Richards and Tracey B. Cowan of Clarkson Law Firm PC, and Brian D. Clark, Laura M. Matson, Arielle S. Wagner, Consuela Abotsi-Kowu and Stephen J. Teti of Lockridge Grindal Nauen PLLP.
Google is represented by Paul J. Sampson, David H. Kramer, Maura L. Rees, Qifan Huan, Kelly M. Knoll, Eric P. Tuttle, Madison Welsh and Jeremy P. Auster of Wilson Sonsini Goodrich & Rosati PC.
The case is In re: Google Generative AI Copyright Litigation, case number 5:23-cv-03440, in the U.S. District Court for the Northern District of California.
--Additional reporting by Dorothy Atkins. Editing by Lakshna Mehta.
For a reprint of this article, please contact reprints@law360.com.