Has Microsoft finally agreed to pay for intellectual property to train its genAI tools?
To train the large language models (LLMs) that power generative AI (genAI) technology, Microsoft and other AI companies need to use massive amounts of data. The more data, and the higher its quality, the more effective LLMs will be.
So it’s not surprising that Microsoft, OpenAI and other AI companies have become embroiled in lawsuits claiming they steal intellectual property (IP) from newspapers, magazines, writers, publishers and others to train their tools. It could take years to resolve the suits, but if the courts rule against AI companies, they could be liable for billions of dollars and forced to retrain their models without the use of that property
Now, though, there are signs Microsoft, OpenAI and other tech firms might be willing to pay for the property. They’re only initial steps, but they could be set in motion the resolution of one of genAI’s thorniest legal issues.
Will that happen, or will the fight over AI and intellectual property drag on for years? Let’s look at the legal issues involved, then delve into the agreement itself to find out how this fight might unfold.
Intellectual property theft or fair use?
Microsoft’s Copilot and OpenAI’s ChatGPT (on which it’s based) are trained on text, much of which is freely available on the Internet. OpenAI hoovers up whatever it finds online and uses that for training. And it doesn’t pay for it. As far as Microsoft and OpenAI are concerned, it’s open season on intellectual property.
A great deal of what they find is free for the taking, and not covered by intellectual property laws. However, they also take a lot of material that is copyright -protected, including articles in newspapers and magazines, as well as entire books.
OpenAI and Microsoft claim that despite copyright-protection they can use those articles and books for training. Their lawyers argue the material is covered by fair use doctrine, a complicated and confusing legal concept. For years there’s been an endless stream of lawsuits over what’s fair use and what isn’t. It’s widely open to interpretation.
The New York Times claims its articles aren’t covered by fair use and has sued Microsoft and OpenAI for intellectual property theft. The suit claims Copilot and ChatGPT have been trained on millions of articles without asking The Times‘ permission or paying a penny for it. Beyond that, it claims that ChatGPT and Copilot “now compete with the news outlet as a source of reliable information.” It’s seeking “billions of dollars in statutory and actual damages” because of the “unlawful copying and use of The Times’ uniquely valuable works.”
The Times isn’t alone. Many other copyright holders are suing Microsoft, Open AI and other AI firms as well.
You might think that billions of dollars overvalues the articles’ value. It doesn’t. Several years ago, Meta held internal discussions about whether to buy one of the world’s largest publishers in the world, Simon & Shuster, for the sole purpose of using the publisher’s books to train its genAI. The publisher wouldn’t have come cheap: Simon & Shuster was sold in 2023 for $1.62 billion. Meta eventually decided not to try to buy the company
Paying to play
With that background, it’s noteworthy that 2024 has seen several agreements between Microsoft, OpenAI and publishers that could be the beginning of the end of the fight over intellectual property. The first, struck in May, was between OpenAI and News Corp, allowing OpenAI to use News Corp’s many publications, including the Wall Street Journal, New York Post, Barrons and others to train OpenAI applications and answer people’s questions.
It’s a multi-year deal whose precise length hasn’t been publicly disclosed, although most observers believe it will last five years. News Corp gets $250 million, a combination of cash and credits for the use of OpenAI technology.
Other media companies have signed similar agreements with OpenAI, including The Associated Press, People owner Dotdash Meredith, and others.
In November, the other shoe dropped. Microsoft cut a deal with the publisher HarperCollins (owned by News Corp) to let it use non-fiction books to train a new genAI product that hasn’t yet been publicly disclosed. It appears that the new tool will be one that Microsoft creates itself, not something based on OpenAI’s ChatGPT.
It’s not yet clear how much money is involved. Individual authors have to agree to let their books be used for training. If they do, they and HarperCollins each get $2,500 per book for the three-year terms of the deal. The deal is non-exclusive, so the rights can also be sold to others. If authors don’t agree, the books can’t be used for AI training.
The deal takes into account many thorny issues unique to book publishing. Only so-called “back-list” books are involved — that is, newly published books won’t be used for a certain amount of time. The books can only used for LLM training, so Microsoft and its new genAI can’t create new books from them. The new tool also can’t output more than 200 consecutive words of any book, as a way to guard against intellectual property theft.
Do these deals point towards the future?
The big question is whether agreements like these will ultimately resolve the intellectual property issues involved in training genAI models. I think that unlikely, and that’s the way Microsoft and other AI companies want it. At the moment, they’re playing divide and conquer, buying off opponents one by one. That gives Microsoft and other tech companies the upper hand. Intellectual property owners might feel that unless they settle now with big tech firms, the company will simply take what it wants, and they’ll lose out on big money.
The issues involved are too important to be handled that way. The courts should rule on this and rule quickly — and they should side with those who own the intellectual property, not those who want to steal it.