Co-founder of Instagram, Mike Krieger, described Github’s Copilot as “the single most mind-blowing application of machine learning I’ve ever seen.” Copilot is a game-changing code autocompletion software Github released in October 2021. Since then, over a million developers have signed up for the $10/month service, with Copilot writing up to 40% of the code in files where it is enabled. Now they’re getting sued for at least $9 billion.
Copilot is mind-blowing. I know because I’ve used it, and my other CS friends who’ve tried it agree. It’s because Copilot was trained on the code written by thousands, maybe millions, of developers who uploaded their code to Github, not knowing their code would be used to train Copilot. This might make you ask whether some of that code was copyrighted since it’s so ‘mind-blowing.‘ That’s where the suit comes in.
Programmer and Lawyer Matthew Butterick is not impressed and is suing Microsoft’s Github for infringing on the licenses on some of the code Copilot was trained on. This lawsuit could shape how our data rights are enforced on the internet and specifically in the training of Machine Learning (ML) models.
Licenses and Copyleft
Some of the code Copilot was trained on was copyrighted under the MIT and GPL class of licenses. The MIT License allows commercial use but requires the copyright and permission notice to be included in all copies or substantial portions of the software. The GPL class of licenses is more restrictive because it allows the use, distribution, and even sale of copyrighted work only if you make your derivative work GPL available. This type of copyright is known as “copyleft.”
I think “copyleft” is so cool in a hacker way because it goes directly against the idea of Copyright. Copyright is the right is to restrict what other people do with your work. You can prevent people from distributing, playing in public, etc.
Copyleft restricts you from using that right. You get free access to “copyleft" works, but anything you do with it must also be free access. It’s like a virus that spreads to any work making it open source.
The copyleft philosophy originated from the early developers of the most popular open-source operating system today, Linux. Linux developers wanted to ensure their work would be available to help people and not just appropriated by greedy software companies. So if you distribute copies of the work without abiding by the terms of GPL, for instance, by keeping the source code secret, like Github is doing, then you can be sued by the original author under copyright law.
This is why there’s a $9 billion estimate of statutory damages. The Digital Millennium Copyright Act (DMCA) has a minimum statutory damage of $2,500 per infringement of the MIT and GPL licenses.
OpenAI
Let’s not forget the elephant in the room, OpenAI. I’ve written about some of the company’s products before. GPT-3 and Dall-E. OpenAI is involved in this suit because its AI Codex-based text-to-code conversion tool was used in Copilot.
However, I think there are similar intellectual property concerns with Dall-E, which is basically Copilot for pictures. Dall-E was also trained on the work of artists who did not consent to its use in training Dall-E. We love Copilot, and it’s a transformative product, but this is because it was trained on the work of many great programmers. These creators are the real heroes, we shouldn’t violate their licenses, and more importantly, maybe we shouldn’t allow corporations to profit off their work.
For now, students can still get Copilot for free. Instructions here.
Thanks to Saaketh Narayan, Quinn Robinson, and Pia Singh for reading drafts of this.
Thank you for reading my newsletter. Share this post with a friend and subscribe for free because you lose a large percentage of your taste buds while on an airplane. This might explain a lot about those less-than-stellar in-flight meals or why you find yourself craving the saltiest foods while in the sky.
A really interesting read