Dr. Alok Aggarwal

Most GPTs Have Been Trained on Copyrighted or Personal Data

As discussed in previous blog posts and in the upcoming book titled, “The Fourth Industrial Revolution and 100 Years of AI (1950-2050)”, Generative Pretrained Transformers (GPTs) are riddled with Machine Hallucinations and Machine Endearment, which together are already creating havoc. Below, we discuss another debilitating limitation, which is likely to cause additional mayhem.

All Deep Learning Networks (DLNs), and more specifically GPTs, require massive amounts of data, which needs to be noise-free and hence, curated. However, curated data often belongs to the entity that curated it or has a copyright. Hence, in Britain, Getty Images, and individual artists have sued Stability AI and other AI companies for using their copyrighted data for training AI algorithms but not compensating them in return. Similarly, in January 2023, a class action lawsuit was filed by three artists – Sarah Andersen, Kelly McKernan, and Karla Ortiz – against Stability AI, Midjourney, and DeviantArt for copyright infringement. In June 2023, two authors, Paul Tremblay and Mona Awad, sued OpenAI for mining data copied from thousands of books without permission, thereby infringing the authors’ copyrights. Interestingly, a Canadian singer, Grimes, saw a money-making opportunity and recently invited AI companies to use her voice to generate new songs if they give her half of their earnings in royalties.

Since some of the data that a DLN or GPT is originally trained on is confidential, this data becomes non-confidential as soon as it is ingested by the DLN or GPT for training. This is also true about the data, becoming non-confidential throughout the continuous training of these DLNs and GPTs. This implies that all confidential and Personal Identifiable Information (PII) within data will become non-confidential. Indeed, after using ChatGPT for several weeks and after providing its confidential data, such a loss of confidentiality was realized by Samsung. Although Samsung stopped this practice immediately after this realization, some of its data had become non-confidential. Finally, the U.S. Federal Trade Commission recently launched an investigation into OpenAI – the creator of ChatGPT – as to whether the company violated consumer protection laws by scraping public data and publishing false information through its chatbot.

In view of such privacy concerns, Italy temporarily banned ChatGPT during March-April 2023. Also, many companies have expressly forbidden that their data cannot be used for training AI models by third parties. Similarly, the European Union has proposed laws that require any company using DLNs to disclose copyrighted material that has been used for training them and perform a fundamental rights impact assessment. Undoubtedly, the emergence of such new regulations and new business models worldwide may impede the current exponential growth of DLNs, and the current hype about DLNs and GPTs may go bust.

The book titled “The Fourth Industrial Revolution and 100 Years of AI (1950-2050) will be published in October 2023. For details, see www.scryai.com/book

Author Picture

Blog Written by

Dr. Alok Aggarwal

CEO, Chief Data Scientist at Scry AI
Author of the book The Fourth Industrial Revolution
and 100 Years of AI (1950-2050)