2022-12-03

Open Source developers and a law firm have filed a class action lawsuit against GitHub, its parent company Microsoft, and OpenAI, which developed

Overview

.

  • Github Copilot is a source code generation AI that is trained on publicly available source code
  • Many of these codes have terms and conditions of use set forth in license agreements.
    • For example, “indication of the author’s name” is a condition of use.
  • The code generated by Copilot does not meet the requirements for displaying the author’s name, etc.
    • Plaintiffs claim this is a violation of their license.
    • The defendant’s previous arguments
    • GitHub CEO (then retiring in November 2021) Nat Friedman’s view that the legal doctrine of “fair use” (fair use exception) applies to Copilot’s use of data…

Filter

.

  • Claim that there is a filter on the Github side.
    • “We have a filter (on/off switchable) that detects and alerts us when code is generated that is publicly available on GitHub,” he explains. It is up to the user to decide whether or not to use the proposed code.

  • There are several reports that contradict this.
    • The famous source code for the game “Quake III” was . . open source developer who claims it was proposed by Copilot.

    • Tim Davis, Professor of Computer Science and Engineering at Texas A&M University, tweeted, “Copilot is outputting large chunks of code that I copyright without attribution or LGPL.

  • The basic premise is that both sides must first establish the facts by providing evidence as to which side’s claims are wrong.
    • My guess is that the filter exists, but there is a discrepancy between what the filter judges to be “the same” and what humans judge to be “the same.
    • To give a concrete example for non-engineers who may not understand what I am talking about, for example, if a filter is implemented to “issue an alert if the source code strings match exactly,” it will not alert if the number of white spaces in the source code is different. But the human would say, “The number of whitespace characters may be different, but they are obviously the same!” and they get angry.
      • I have a feeling nishio.icon would not have been so badly implemented even back then.
        • However, that’s not to say that if an unthinking engineer rushed to do a rush job, he or she wouldn’t end up with that kind of code.
      • I think it means that if you set various parameters to be “roughly this safe,” there are a surprising number of “people who repeatedly perform actions that are likely to produce problematic code in order to generate criticism.
      • After the criticism that has been flared up by the quality of this filter, no doubt it has been reviewed by an engineer of adequate technical skill and the parameters have been re-set to be stricter.
      • So I’m sure the defendants will argue that the phenomenon is a problem that existed in previous versions of the software and has now been fixed, so it won’t recur.
      • They’re going to put up a bunch of generated results and the most similar source code pairs as evidence, and say, “See, it’s not an infringement of the right to reproduce,” and then they’re going to say, “Yeah, that’s right.

Does it fall under fair use?

  • It’s being contested under U.S. copyright law, so it’s a debate about whether fair use or not.
    • If it were done in Japan, it would be Article 30-4 of the Copyright Act.
    • Either way, the issue is “whether it unreasonably harms the interests of the copyright holder.
  • Mr. Butterick, Plaintiff
    • “If we let people use the code without a license notice, we kill the open source movement itself.”

    • I don’t know if this is intentional, but you use the word “use” ambiguously.
      • For example, if you look at GPLv3 as a specific license, it says that the permission to use is `legal permission to copy, distribute and/or modify it
      • The license says nothing about making this source code subject to machine learning in the first place.
      • If Copilot produces the same thing as the training data, then it could fall under copy or modify, but I’m sure that’s been fixed.
  • Past Examples
    • In 2005, a class action lawsuit was filed against Google by the Authors Guild, a writers’ group, and three authors, claiming that the “Google Books” project, which scans, digitizes, and registers library books, constitutes copyright infringement.

    • The Google Books lawsuit was fought over a decade, and in October 2015, a federal appeals court ruled in favor of Google in a second trial, finding fair use. The authors appealed, but the appeal was denied.


This page is auto-translated from [/nishio/GitHub Copilot集団訴訟](https://scrapbox.io/nishio/GitHub Copilot集団訴訟) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.