US Court rules out modification in copyright infringement law
It was deliberating on the Authors Guild, Inc. v. HathiTrust and the Authors Guild, Inc. v. Google, Inc., cases.
US Court rules out modification in copyright infringement law
It was deliberating on the Authors Guild, Inc. v. HathiTrust and the Authors Guild, Inc. v. Google, Inc., cases.
The United States Court of Appeals for the Second Circuit has ruled that the scanning was a fair use because the creation of a full-text searchable database was “quintessentially transformative use”, meeting the criteria of the first factor. It added that similarly, Generative Artificial Intelligence (GenAI) transformed semantic relationships from copyrighted works into an internal model and thus met the criteria of the first factor.
The court was deliberating on the Authors Guild, Inc. v. HathiTrust and the Authors Guild, Inc. v. Google, Inc. cases.
Both HathiTrust Digital Library (HDL) and Google had scanned books to allow for searching and were sued by the Authors Guild for copyright infringement.
The court determined that the second factor was not dispositive and stated that the scanning process “provides valuable information about the original, rather than replicating protected expression in a manner that provides a meaningful substitute for the original” and thus met the criteria of the second factor. Similarly, GenAI analyzed the relationships between words, sentences, paragraphs, and concepts in the copyrighted works. Thus, it provided valuable information about the original.
The court stated, “There is no benefit solely from reading or observing the content. Thus, training input cannot be copyright infringement.”
It added that while much of the focus on GenAI has been on training data ingestion - the moment the AI ‘stole’ from creators, however, legally, that’s not where the real fight should be. Decades of legal precedent - from search engines to image-scanning to streaming media, everything provided a roadmap. No new formulation of copyright law by Congress (as suggested by some academics) was necessary. By considering seven unique features of GenAI systems, copyright analysis is quite simple.
1) Training Input
The most common way to obtain training data was by using publicly available sources. Many times, private data was accessed without permission, such as data behind a paywall that was not purchased, or pirated data was used. But such cases have problems beyond copyright infringement.
The intent of the Copyright Law was to promote the advancement of knowledge and the arts. The consumption of copyrighted materials by individuals or automated systems is aligned with this purpose. Thus, accessing such materials did not violate any of the exclusive rights of copyright holders: reproduction, creation of derivative works, distribution, public performance, or public display.
2) Storage
Some GenAI systems stored training data for long periods while others stored it for short times, just enough to map relationships between types of elements, including how words were generally assembled into sentences or how musical notes followed patterns. Whether storage comprised copyright infringement needed to be examined whether the system used short-term or long-term storage.
3) Short-Term Storage
The Copyright Act defines copies as “material objects in which a work is ‘fixed’ in a tangible medium of expression when its embodiment in a copy is sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a more than transitory duration.”
The Second Circuit court established a precedent in the Cartoon Network, LP v. CSC Holdings, Inc. (Cablevision) case, that a copy remaining in memory for 1.2 seconds before being overwritten by subsequent data was transitory.
While noting that its decision was specific to the case, the 1.2 seconds became a de facto minimum threshold. Modern high-speed GenAI systems that do not store data long-term only store input training data for less than 1.2 seconds and thus do not constitute copyright infringement.
4) Long-Term Storage
Long-term storage of copyrighted data seems to meet the criteria of fixed copies, as these can potentially be “perceived, reproduced, or otherwise communicated.” Thus, the question is whether the long-term storage constitutes fair use.
The four factors of fair use:
1) Purpose and character of the use
2) Nature of the copyrighted work
3) Amount and substantiality of the portion copied
4) Effect upon the potential market for or value of the copyrighted work.
While stating that HDL’s service was fair use by consideration of the third factor, the court held, “Because it was reasonably necessary for the HDL to make use of the entirety of the works to enable the full-text search function, we do not believe the copying was excessive.”
The same argument holds for GenAI, which must copy entire works to understand and learn from the appropriate semantic relationships.
While determining that HDL’s service was fair use by consideration of the fourth factor, the court ruled that because the book scanners did not “allow users to view any portion of the books they are searching in providing the service, the HDL does not add into circulation any new, human-readable copies of any books.”
While holding that Google Books was fair use by consideration of the fourth factor, the court admitted that by providing readers snippets of copyrighted books, there could be some loss of sales. However, it added, “Some loss of sales does not suffice to make the copy an effectively competing substitute that would tilt the weighty fourth factor in favor of the rights holder in the original. There must be a meaningful or significant effect ‘upon the potential market for or value of the copyrighted work’ to not be fair use.”
The US court added that because GenAI storage did not, by itself, give access to the original materials, it met the criteria for the fourth factor, which was relevant to the output.
5) Output
Whether the output of a GenAI system comprised copyright infringement depended on the type of output. The bench classified two types of GenAI output - repurposing and non-repurposing.
6) Repurposing Output
Repurposing GenAI systems that learn from one type of data and produce a different type of output. For example, a security system trained on facial images might generate alerts about specific individuals passing a camera. Another might process road sign images to help autonomous vehicles navigate. Copyright infringement requires that an output closely resemble copyrighted training data, either literally or non-literally (e.g., structure, sequence, and organization). Since repurposing GenAI systems produces outputs different than their inputs, they do not infringe copyrights.
7) Non-Repurposing Output
Non-repurposing GenAI creates outputs matching the type of data it was trained on. For eg: a non-repurposing system trained on English texts will produce English documents like research papers, legal briefs, or short stories. A system trained on artwork generates artwork. And the one trained in music creates songs. These systems can involve copyright infringement.
For copyright infringement, the output must be substantially similar to the protected training material. Infringement can be literal (exact copies) or nonliteral, where the structure or unique elements are copied without exact words. Nonliteral infringement is somewhat subjective, wherein a literal content from training data present in GenAI outputs is not substantial and unlikely to qualify as literal infringement. However, nonliteral infringement, such as imitating another creator’s style, requires further analysis, including whether fair use might apply.
The output of a non-repurposing GenAI system has the same use as the training input. Therefore, the output would not meet the first factor for fair use, a different purpose and character than the original.
Whether the second factor, the nature of the copyrighted work, applies to a non-repurposing GenAI system depends on the use of the system. Creating a list of information, such as the list of highest-grossing films or the names of the capitals of each state in the United States, would most likely meet the criteria of the fair use factor. Similarly, creating an artistic work would be less likely to pass this fair use factor. Creating a novel in the style of a specific author would be even less likely to pass this fair use factor.
Whether the third factor, amount and substantiality of the portion copied, applies to the system depends on specific outputs of the specific GenAI system and how much of the copyrighted training inputs appear in the outputs.
The court held that the fourth factor’s effect on the market was difficult to gauge.
It questioned how much a novel in the manner of J.K. Rowling or a copy of a Peter Max painting reduces the market for original works. The US court held that in art collecting, widely available prints often boosted the value of originals by increasing public exposure and demand. Thus, could GenAI-generated works similarly enhance the market for originals? A high-quality Picasso print costs much less than an identical original because the original was made by the artist.
No Modification Necessary
The US Court of Appeals for the Second Circuit thus stated that reviewing the existing case law denoted that training input was not infringing.
It added that the storage may be short-term (non-infringing) or long-term (fair use). The output involved repurposing (not infringing) and non-repurposing (potentially infringing, assessed case-by-case). The existing court rulings adequately addressed these concerns. Thus, there was no need to change the current Intellectual Property law, as it effectively balanced the GenAI innovation with copyright protection.