The alleged theft at the heart of ChatGPT

Episode Summary

ꜜ

The podcast discusses how OpenAI's chatbot ChatGPT may have used copyrighted books without permission to train its artificial intelligence. Author Douglas Preston realized ChatGPT could provide detailed information about characters and plot points from his novels, indicating it had likely "ingested" his books. Preston joined with other authors like George R.R. Martin to file a lawsuit against OpenAI for copyright infringement. The authors argue OpenAI scraped their books without permission to create its AI, which will compete with and undermine the market for their works. OpenAI contends its use of copyrighted books qualifies as "fair use," though the authors dispute this. There is precedent for these tech copyright disputes, like when Google scanned millions of library books and Spotify streamed songs without full licensing. Those cases ultimately settled, with the companies paying for past infringements. The Spotify case especially shows how a lawsuit can benefit tech firms by consolidating plaintiffs to negotiate with. Though statutory damages could total billions, the authors' lawsuit will likely end in a settlement. Still, it raises concerns about AI companies using copyrighted content, perhaps "asking permission later" as the tech industry tends to do.

Episode Show Notes

ꜜ

When best-selling thriller writer Douglas Preston began playing around with OpenAI's new chatbot, ChatGPT, he was, at first, impressed. But then he realized how much in-depth knowledge GPT had of the books he had written. When prompted, it supplied detailed plot summaries and descriptions of even minor characters. He was convinced it could only pull that off if it had read his books.

Large language models, the kind of artificial intelligence underlying programs like ChatGPT, do not come into the world fully formed. They first have to be trained on incredibly large amounts of text. Douglas Preston, and 16 other authors, including George R.R. Martin, Jodi Piccoult, and Jonathan Franzen, were convinced that their novels had been used to train GPT without their permission. So, in September, they sued OpenAI for copyright infringement.

This sort of thing seems to be happening a lot lately–one giant tech company or another "moves fast and breaks things," exploring the edges of what might or might not be allowed without first asking permission. On today's show, we try to make sense of what OpenAI allegedly did by training its AI on massive amounts of copyrighted material. Was that good? Was it bad? Was it legal?

Help support Planet Money and get bonus episodes by subscribing to Planet Money+ in Apple Podcasts or at plus.npr.org/planetmoney.

Episode Transcript

ꜜ

SPEAKER_04: This message comes from NPR sponsor Mizzen & Main. You deserve a comfortable, breathable, packable, and machine washable dress shirt to close out the year in style. Use promo code MONEY to get 25% off orders of $130 or more at MizzenandMain.com. SPEAKER_07: Before we start, this episode discusses Google and Spotify, which are both corporate sponsors of NPR. We also discuss OpenAI. One of OpenAI's major investors is Microsoft, which is also a corporate sponsor of NPR. Here's the show. SPEAKER_01: This is Planet Money from NPR. SPEAKER_07: Douglas Preston got his big break as a writer when he and his co-author published their first novel, Relic, in 1995. SPEAKER_02: Relic is about a brain-eating monster loose in a museum, hunting down and killing people and eating part of their brains. So it's, uh, you know, you will not see my name on the list of Nobel laureates. That's for sure. SPEAKER_05: No, Nobel maybe, but the book was a bestseller. The first of many. SPEAKER_07: And how many books have you written altogether? SPEAKER_02: I'm not sure. I think about 40. SPEAKER_07: Douglas also somehow finds time to write all these articles and books about paleontology and archeology. He's got a lot of interest. He's a curious guy. SPEAKER_05: And one day that curiosity led him to start playing around with the tech world's shiny new thing, artificial intelligence, specifically OpenAI's chatbot, ChatGPT. SPEAKER_07: Douglas got himself an account and started seeing what this fancy new AI chatbot could do. While we talked, he scrolled back through his history and read me some of his earliest queries. SPEAKER_02: I had to write a paragraph about the execution of Socrates. Please discuss Chopin's piano concerto number one, discuss the transcendental number E. Okay. SPEAKER_05: So it appeared to know some math and some history and some music, and it didn't take long for Douglas to wonder, does it know me? Specifically the ChatGPT know anything about the books he had written. So he starts testing it. SPEAKER_02: Are you familiar with a character called Whittlesey in the novel Relic? Yes. Dr. Whittlesey is one of the characters in the prologue of the book. He's part of the expedition team that travels to the Amazon rainforest and makes a significant discovery which sets the stage for the events that unfold in the story. SPEAKER_07: Is that answer correct? SPEAKER_02: Yes. SPEAKER_05: And Douglas was like, how does it know all that stuff? SPEAKER_02: The Wikipedia entry on Relic doesn't have this kind of detail. And Relic was reviewed, but the reviews were never fine grained like that. The only way I would know that is if it had ingested the book. SPEAKER_07: Douglas kept going. He asked about other books he had written. ChatGPT knew that his character, Agent Aloysius Pendergast had platinum hair and how Corey Swanson was a headstrong forensics expert. SPEAKER_02: It was regurgitating everything. It knew my characters. It knew their names. It knew the settings. It knew everything. SPEAKER_05: So yeah, it certainly seemed like ChatGPT had access to his full books, maybe legitimate digital copies, maybe pirated PDFs floating around the internet. Who knows? SPEAKER_07: But either way, Douglas owned the copyright to all of his books and no one from OpenAI had asked him whether they could use them, which raised the question, can they do that? Hello and welcome to Planet Money. I'm Keith Romer. SPEAKER_05: And I'm Erica Barris. What happened to Douglas Preston feels a little like a thing that keeps happening to all of us. One giant tech company or another swoops in and just does a bunch of stuff without our permission like keeping track of the websites we visit. SPEAKER_07: Google, I see you. Or showing up to a city and setting up a new unregulated kind of taxi service, even though the city says you can't do that. Hi Uber. SPEAKER_05: It's like the famous Mark Zuckerberg line, move fast and break things. Tech companies have been doing a lot of that. SPEAKER_07: And the latest example is OpenAI and all those other new AI companies, hoovering up every last piece of human creativity to build their incredibly powerful computer programs. SPEAKER_05: Today on the show, we try to get our heads around what OpenAI is up to. Is it good? Is it bad? Is it legal? SPEAKER_07: And we'll look back at these two formative legal cases that are super fascinating on their own, but also offer us a glimpse of how things with OpenAI might turn out. SPEAKER_03: This message comes from NPR sponsor American Express Business. The enhanced American Express Business Gold Card is designed to take your business further. It's packed with features and benefits like flexible spending capacity that adapts to your business, 24-7 support from a business card specialist trained to help with your business needs, and so much more. The AmEx Business Gold Card, now smarter and more flexible. That's the powerful backing of American Express. Terms apply. Learn more at AmericanExpress.com slash business gold card. This message comes from NPR sponsor Citi. They're not an airline, but their network connects global businesses in nearly 160 local markets. For over two centuries of experience, they're not just any bank. They are Citi. More at Citi.com slash we are Citi. SPEAKER_05: Okay, so we should maybe start by talking a little bit about how Chat GBT works. It's an interface built on a kind of artificial intelligence called a large language model. And what that AI does is essentially predict what the next word in a sentence will be, like autocomplete, but on the grandest scale you can imagine. SPEAKER_07: And to train the AI to do that, computer programmers have to feed it just massive, massive amounts of coherent writing. The technology is only possible because of all that text that it gobbles up. SPEAKER_05: What the author Douglas Preston suspected was that a lot of that text came from copyrighted material, his books and other people's books. SPEAKER_02: I'll never forget a conversation I had with my friend George R.R. Martin, and he was really upset. Somebody used Chat GBT to write the final book in my Game of Thrones series. It's my characters, my settings, even my voice as an author. They somehow were able to duplicate using that program. SPEAKER_07: Douglas and George R.R., they got together with 15 other authors and decided to sue OpenAI. SPEAKER_05: Their lawsuit is a class action. They're suing on behalf of themselves and any other professional fiction writers whose work may have been eaten up to create Chat GBT. SPEAKER_07: What evidence do we have that OpenAI was using copyrighted books in its training sets? SPEAKER_06: That is a really good question. SPEAKER_05: That is Mary Raasenberger. She is a copyright lawyer and the CEO of the Authors Guild. Douglas and George R.R. and the other authors wound up partnering with the Authors Guild for their lawsuit against OpenAI. They alleged copyright infringement on an industrial scale. SPEAKER_06: So we do not know because OpenAI, even though they say they're open, they're quite the contrary. They are about as closed as can be in terms of what their training data sets are. SPEAKER_07: Is Jonathan Franzen's book the corrections in the training data? What about My Sister's Keeper by Jody Picot or Lincoln in the Bardo by George Saunders? Those authors, by the way, are all plaintiffs on this lawsuit. SPEAKER_05: To start building their case, the authors and their lawyers went looking for concrete evidence. And if the humans at OpenAI wouldn't disclose their training data, maybe there was a way to trick OpenAI's computer program into giving up its sources. SPEAKER_07: Some of the lawyers working with the Authors Guild got to work trying to coax Chat GBT into revealing what it knows. They asked it questions to see how much specific information it can offer up about any particular book. SPEAKER_06: And of course, when you could get it to give you back exact text, clearly it had memorized the book. SPEAKER_07: That seems like a strong sign if it can give you the actual chapter of the book. SPEAKER_07: Because of the court case, Mary was a little cagey about giving exact details here. But other researchers have managed to get Chat GBT to spit up an entire Dr. Seuss book, full chapters of Harry Potter. SPEAKER_05: Still, to really make the case that OpenAI had, in fact, used all these thousands and thousands of books to train its AI, what the authors really needed was access to the company's records. Which, Mary says, was another reason to sue. In a lawsuit, you get discovery and presumably we'll find out what the training data set is SPEAKER_06: and what was ingested. So, okay. SPEAKER_07: This lawsuit was just filed in September. And so this is kind of where the author's story pauses for now, because it could take literally years for this to work out. SPEAKER_05: But like we said before, there are precedents for what happens when a giant tech company snatches up heaps of copyrighted stuff. Two cases really stand out here. SPEAKER_07: Okay, case number one, the time Google decided to scan, like, all of the books and put them on the internet. And case number two, the time Spotify decided to go ahead and put, you know, all of the songs on the internet. SPEAKER_05: Okay, let's start with the first case. The one about Google and the books. In some ways, this is kind of the law's first big brush with the problem of how much copyrighted material a tech company can scoop up. SPEAKER_07: It is a case that Mary from the Authors Guild remembers well. SPEAKER_06: So the Google Books case was filed in 2005. SPEAKER_05: Google wanted to create what some people refer to as a digital library of Alexandria. Yeah, they made all of these deals with big university libraries around the country that SPEAKER_07: let them come in and add all these books to their giant searchable Google databases. SPEAKER_06: They had ingested, copied millions of books. They literally were just taking truckloads of books out of libraries and scanning them. SPEAKER_05: Google had permission from the libraries to scan the books, but they did not ask permission from the authors. And around 80% of those books were still protected by copyright. So authors and publishers sued. SPEAKER_07: Now everyone agreed Google had copied lots of copyrighted material without permission from the authors. But copyright, it's not absolute. There are some exceptions. SPEAKER_05: Yeah, copyright law is trying to balance these two interests. On the one hand, a desire for authors to be allowed to make money from what they've created. But on the other hand, a desire for the rest of society to sometimes be allowed to borrow and remix and play around with the work of those authors. SPEAKER_07: The fancy legal name for the kind of copying that the law says is okay is fair use. SPEAKER_06: So the traditional fair uses are things like quoting, quoting from another book in your book or from a speech, commentary. So when you do a critique of a play or a book, you're going to include perhaps some of the text from it. SPEAKER_05: Writing a song to write a parody like Weird Al Yankovic style is also usually fine. Same with photocopying a couple pages of a novel to teach in an English class. SPEAKER_07: But what about what Google was doing? Scanning millions and millions of books to create a searchable database. No one had ever seen anything like that before. SPEAKER_05: Now there is no hard and fast rule for what counts as fair use and what doesn't. There are these four different factors that a judge is supposed to look at to decide whether a certain act of copying is permissible. Yeah. SPEAKER_07: Is someone going to make money off of it or are they just doing it for the sake of doing it? Will it hurt the market for the original work? Is it a big important chunk that is copied or a small one? And is the thing that was copied transformed somehow into something new? SPEAKER_06: I will say that the test can be somewhat subjective and you know, the great minds can come out differently sometimes on fair use. SPEAKER_05: The great minds in the Google case, a judge named Pierre Laval, he weighed all those fair use factors and decided that all that copying Google had done was fair use. SPEAKER_07: It all came down to the end product Google had created. A giant database of books that people could search directly that would give them back relevant chunks of these books. It was a way for people all around the world to access books that otherwise might have just gathered dust in the basement of a big library somewhere. Judge Laval thought that was valuable enough to society that it made all that copying legally okay. SPEAKER_05: And it's worth pointing out this kind of weird thing about copyright here. Fair use is not this cut and dry thing. So when a company like Google wants to play around at the edges of copyright, it has to just dive in without knowing for sure whether or not the thing they're doing will turn out to be legal. SPEAKER_06: You can't always predict the outcome, let me say it that way. SPEAKER_07: It is wild that these companies are in some ways incentivized to take a risk of some amount and see if it works out because the courts will decide one way or another eventually. SPEAKER_06: Well that's what the tech companies like to do. You know they like to ask permission later. Just do, don't tell anyone what you're doing and then just see what happens. SPEAKER_07: And just to bring this back to where we started this episode, that is certainly what it appears OpenAI has done with ChatGPT. SPEAKER_05: By the way, we reached out to OpenAI, they declined to comment. But in court filings, they've made it pretty clear that they think what they did to train their AI, that was fair use. SPEAKER_07: Right. Getting a Google Books type ruling would be a great outcome for OpenAI. Mary who is part of the suit against OpenAI, she does not see it that way. SPEAKER_06: This case is very different than that case because here the harm is so visible. It's so clear that the marketplace for creators works will be harmed by generative AI. SPEAKER_05: Which, remember, that is one of the four factors a judge is supposed to look at in a fair use case. How much will the owners of the copyright be financially hurt by the copying of their books? It's the commercial use of the works to develop these machines that will spit out very quickly SPEAKER_06: massive quantities of text that will compete with what they were trained on. That's the issue here. SPEAKER_07: All the Dan Brown novels I could ever want. SPEAKER_07: Yeah. So, okay. If some judge decides that it is fair use for OpenAI to train ChatGPT on copyrighted material, then, like Google Books, that's it. Sorry, authors. SPEAKER_05: But what about the other end of the spectrum? What if a judge says all that copying was against the law? Thousands of authors with dozens and dozens of books and each one is a copyright violation. SPEAKER_07: After the break, we do the math on how much that could cost OpenAI. And look at the most likely scenario for how all this plays out. SPEAKER_04: This message comes from NPR sponsor Autograph Collection, a collection of almost 300 independent hotels thoughtfully crafted to leave a lasting impression. Each hotel reflects a unique vision which comes to life in every aspect of the experience, from interior design details to unique moments throughout your stay. Visit autographcollection.com and find something unforgettable. Autograph Collection is part of Marriott Bonvoy, an extraordinary portfolio of hotel brands and an award-winning travel program. Support for NPR and the following message come from Progresso. When you're having a busy day, stepping away for a break can make all the difference. Progresso invites you to reclaim lunchtime and sink into your favorite cozy armchair. For their traditional chicken noodle soup, savor each bite of warm, flavorful broth and tender chunks of chicken with carrots and celery and tasty noodles. For more delicious ways to reclaim lunchtime, visit Progresso.com. SPEAKER_07: So in the last few years, tech companies have been basically vacuuming up all of human knowledge and culture to train their AIs. And lately, some of the creators of all of that human knowledge and culture have started pushing back. SPEAKER_05: Yeah, in addition to the author's lawsuit against OpenAI, there are at least eight other lawsuits, brought by songwriters and visual artists and other authors against a bunch of AI companies, all alleging copyright infringement. SPEAKER_07: And like we talked about before, it's possible that legally, all of this is fine, that some court may decide this is fair use. But it's also possible that they won't. SPEAKER_05: So in that world, a judge tells OpenAI, your AI is illegal. Shut it down. Well, the thing is, it's not like OpenAI can simply remove the selected works of Douglas Preston and George R.R. Martin from their AI's brain. The company would have to basically start from scratch and completely retrain their AI. And then there's the money. SPEAKER_07: So let's run a little back of the envelope math here. The statutory damages for a single act of copyright infringement can reach as high as $150,000 per infringement. Figure 10,000 authors, 10 books per author. You know what that multiplies out to? $15 billion. However, it is very unlikely that that will happen, which we will show you through case SPEAKER_05: number two. SPEAKER_07: Yeah, the time Spotify decided to stream all the songs. This one shows how sometimes a gigantic lawsuit can actually be a good thing for the tech company getting sued. SPEAKER_05: To help explain this one, we reached out to UCLA law professor, Zian Tang. SPEAKER_01: I guess I would say that I wanted to be a copyright lawyer from the time I was 16, which sounds really weird. Let's say unusual. SPEAKER_07: We don't have to say weird. SPEAKER_01: Yes, it's very unusual. SPEAKER_05: Before she was a professor, Zian worked for a few big law firms. SPEAKER_01: I worked on a Red Bull class action where the claim was like, you know, Red Bull gives you wings, but it actually doesn't give you wings. There's no more caffeine in it than a cup of coffee. Or like, you know, I bought this anti-aging product because I thought it would turn back time and I, you know, I'm 40, but I thought I would look 18 and I don't. And now I'm suing for it on behalf of myself in a class. SPEAKER_05: And Zian was one of the lawyers on Spotify's defense team during the big case we're going to talk about. SPEAKER_07: Right. Spotify had been streaming millions and millions of songs, but they hadn't gotten licenses for all of those songs. SPEAKER_01: The two main plaintiffs in the lawsuits that then eventually got consolidated into one lawsuit. One was filed by a songwriter named Melissa Ferrick and another was filed by a songwriter named David Lowery. He was in a band, a couple of bands that, you know, I think a lot of people are familiar with. One was Camper Van Beethoven. One was Cracker. SPEAKER_07: Eric, are you more of a Low fan or more of a What the World Needs Now fan? I actually don't know either of these. SPEAKER_07: No, no Cracker songs? All right. I'll stay over here on Gen X Island by myself. I'm Gen Y. SPEAKER_05: I'm the secret, secret generation that lasted one year after Gen X. Okay. In any event, the lawsuit basically came down to this. SPEAKER_07: 90% of the songs that Spotify wanted to stream in the US were managed by a handful of big companies and Spotify had signed licensing deals with those companies. SPEAKER_05: But that left this last 10% of songs that Spotify also wanted to stream. Spotify hired an outside company to get deals with the copyright holders for those songs, but someone somewhere along the line dropped the ball. And even though they didn't end up getting licenses for all those songs, Spotify went ahead and streamed them anyway. SPEAKER_01: And so Spotify tried and wanted to do everything right by the books. But the reality is that it's the music publishers themselves that have really bad data that makes it like near impossible for someone to figure out who to pay. SPEAKER_07: That feels like an argument that I would be sympathetic hearing from my nine-year-old daughter in terms of like, I tried to do the right thing, but I couldn't. But legally, would that hold any water in terms of, it's not our fault. We couldn't do it. We tried. SPEAKER_01: So, you know, I think there's a couple parts to your question. One is legally would it hold water? No. I mean, legally, it wouldn't hold water. Do they have a point? I think they did. SPEAKER_05: And this is where the Spotify case gets really interesting because Ziyun says getting sued by those two songwriters was kind of fantastic news for Spotify. SPEAKER_01: I'm definitely not speaking for Spotify here when I say it's almost a blessing, but it does almost feel like a relief to be able to say, oh, now we have this class that's established with all these people in it. Let's pay some amount of money that's not going to bankrupt the business and allow us to say, hey, we're actually paying all these people now, whereas the allegation was that we weren't before and we can keep operating. SPEAKER_07: So I mean, it sounds a little like Spotify's essential problem was not having an opposite side to negotiate with and the class action essentially gave them somebody to negotiate with. SPEAKER_01: Yes. It's like, you know, we Yeah, exactly. We didn't know who to even go out to and talk to about this. And now these people are popping up out of the woodwork and saying, hey, it's me. And you know, thinking about Taylor Swift, I'm the problem. It's me. SPEAKER_07: I immediately had to say my daughter was in Taylor Swift 24 hours a day and I was, you said those words and I was like, yep, that song's in my head now. SPEAKER_01: Right? Yeah, I'm the problem. Actually, I'm the legal problem. Negotiate with me. SPEAKER_05: I mean, put yourself in Spotify shoes. There's this 10% of songs that they wanted to license, but tracking down every indie artist and every indie indie artist and unspooling the knot of publishing rights, it wasn't gonna happen. SPEAKER_07: And then one day, these two musicians show up and say we represent that entire 10%. Like that is kind of great for Spotify. In the end, the class action didn't go to trial. SPEAKER_05: The company and the folks who had songs in that tricky 10% ended up reaching a deal. Spotify agreed to pay them for all its past copyright infringements and set up a system to pay for streaming royalties going forward. And you know, if we were looking for examples of how the class action by the authors against SPEAKER_07: open AI might play out, there's a really good chance this is it. No giant dramatic trial, just two sides working out a deal. SPEAKER_05: Ziyun has looked pretty deeply into the history of these kinds of cases. SPEAKER_01: I did a study where I looked at every single class action that was filed between basically that the advent of the class action mechanism, you know, a century ago to recent date up to the point where the article came out, which was I think last year. SPEAKER_05: In over 100 copyright class actions, only one ever went all the way to a full trial. SPEAKER_07: And yet they keep being filed. SPEAKER_01: And they keep being filed. And you know, that's why I say it's almost like it's an invitation to settlement, I think. SPEAKER_07: So essentially, we have this whole legal theater, which is just the beginning of a negotiation. SPEAKER_01: Yes, correct. SPEAKER_07: So if you think about the author's lawsuit from open AI's perspective, maybe the lawsuit isn't the worst thing. The company has used all of this copyrighted material, allegedly, hundreds of thousands of books. There is no good way to unfeed all of those books to their AI. But also, it would be a huge pain to track down every single author and work out a licensing deal for those books. So maybe this lawsuit will let them do it all in one fell swoop, by negotiating with this handy group of thousands of authors who have collectively sued them. This episode was produced by Willow Rubin and Sam Yellowhorse Kessler. It was edited by Kenny Malone and fact checked by Sierra Juarez. Engineering by Robert Rodriguez. Alex Goldmark is our executive producer. Coming up next week on Planet Money, China's economy is on the brink of a crisis. And we're going to figure out how they got there. Quick hint, it's real estate. SPEAKER_00: You know, I was in that game. So if you know you're not taking a maximum risk to expand your business empire, next year you look at your peers and say like, damn, you know, I only built 10,000 apartments. They already there selling 15. I'm behind. SPEAKER_07: That's next week on Planet Money from NPR. Special thanks today to Danielle Gervais, Dawa Keila, and Douglas Preston's co-author, Lincoln Child. I'm Keith Romer. SPEAKER_05: And I'm Erika Barris. This is NPR. Thanks for listening. SPEAKER_03: My renewable diesel, the only top tier certified fuel made from 100 percent renewable raw materials backed by two plus decades of renewable innovation. Make the switch by visiting nest a my.com.