Barbarians at the Gate: AI and Copyright on a Collision Course

Apr 12, 2019

Article ImageWhile the steady advance of applied artificial intelligence (AI) technology promises to dramatically increase the value of computers in daily business life—for example, helping business professionals monitor and analyze what their competitors are doing— AI and machine learning in competitive intelligence and customer insights applications are on a collision course with copyright law and the content publishing industry.

Traditionally, a search engine delivers up a list of documents with a brief summary in response to a user query. For example, a user wants to know what the sales forecast is for personal cloud services in 2020. The search engine directs the user to one or more documents that contain the forecast; the user is enticed to read these documents by a snippet or summary within the search result. Then the user clicks through or downloads one or more of the documents to find answers to their questions. This click-through or download supports the publishers’ business models in one way or another. 

Enter AI and machine learning. What if the machine can “read” the documents, summarize the material, and extract answers to users’ questions without requiring users to read the documents? The user types in a search query for a forecast of sales of personal cloud services in 2020 and gets back a response from the search engine “$4.9 billion”—no document download required. The market research report or reports the data was taken from are not “consumed” by the user. Perhaps only one copy of the report containing the data was purchased and read by the machine, but every user in the organization can now get answers to questions from the material.  Or what if multiple reports from different publishers contributed to the answer?

Providers of search technology are excited by this looming jump in the utility of search and are falling all over themselves to establish their machine learning solutions in the marketplace for application developers to use. For their part, organizations that consume content, in their quest for automated insights, want their AI-enabled search engine to have unfettered access to the largest, most robust, broadest, and most authoritative content sets—but they don’t care if their individual users have direct access to the underlying documents. 

What does this do to a premium content provider’s subscriber base, content licensing revenue, and advertising revenue stream? How does it affect a firm’s intellectual property rights? What are the copyright implications of getting “answers” based on automated analyses of documents instead of providing links to documents? What is the meaning of a content license when the content is analyzed by a machine, the intelligence consolidated with informational bits from other publishers, and answers given directly without the need to ever download a document to a user? What changes are required to licensing and subscription models of the future to deal with these issues? 

In this brave new world of machine learning-based cognitive search, content aggregators may become content publishers’ best friends. When the cross-publisher AI application sits at the aggregator rather than the user organization, aggregators can report to publishers how frequently, by whom, and for what purpose their content is accessed. On the other hand, if a publisher sends full-text content directly to a user organization, the usage details are hidden behind the customer’s firewall and content consumption is a complete mystery. 

From a content pricing perspective, having machines as “users” fundamentally alters (or obviates) the traditional notion of a “seat” as commonly thought of in content licensing arrangements. One possible solution to this business dilemma may be an “AI license” priced along the lines of an enterprise license. Such a model could help ensure fair value when the “user” is a machine that can digest and synthesize content to produce answers to specific questions on a scale far beyond what an individual human researcher could do. The assumption would be that when an AI machine learning application consumes a document, everyone in the organization consumes it, too.

The AI dilemma also touches web-based content collections from a copyright compliance perspective. A number of courts in different cases have ruled in favor of web search engines that have been sued by website content owners for copyright infringement. In those cases, they have found that fair use permits web search engines to provide indexes, citation metadata, tags, and excerpts of copyrighted content.

Supporting this finding, courts have ruled that the provision of a web-content index supports a “transformative” purpose, which is the ability of users to sift through a large amount of information that, practically speaking, would be impossible without the search engine indexes. This sifting process is different than any purpose intended for the original content, and hence transformative.

An additional factor supporting the success of the fair use defense by web search engines has been that providing links to original material on the web supports the notion that the indexing, excerpting, and summarization of web content as practiced by web search engines does not negatively impact the audience for the copyrighted material, and  actually improves it. The question is whether the copied work, in this case the search excerpts reproducing text from the web pages on search results, substitutes for the original content on the web pages. Google News reports a 56% click-through rate, which has been cited as evidence the excerpts of copyrighted web pages in Google’s search results are not in fact substitutes.

However, if an AI application extracts data and/or synthesizes and distills the insights from one or many web-based articles without attribution or links, the content sources would be invisible to the consumer, and there would be no opportunity for the user to click through to the original piece. And so the unresolved question is: Are we in copyright violation territory? If a human analyst read the articles and used the knowledge gleaned to write a report, nobody would blink an eye. However, if a machine reads the articles and summarizes them, it feels different somehow. But is it really?

We are reaching a point of discontinuity in terms of the perceived value of intelligent search summaries (created automatically by machine learning) versus the value of the rich source documents they draw upon. Indeed, many content consumers will ascribe greater value to the summary that offers readily digestible “answers” derived automatically from documents from many publishers than to the underlying source documents.

As a result, the AI industry is on a collision course with the content publishing industry. When the collision occurs, it will be a conflagration of major proportions.  Like the Roman sentries that warned about “barbarians at the gate” as Attila the Hun approached, the publishing community would be wise to pay close attention to the role their content will play in AI/machine learning-based enterprise search applications going forward. Both consumers and publishers of digital content should consider if it is time for the “AI license” to join the ranks of “enterprise licenses” and “seat licenses” as a business model.

Related Articles

Although copyright protection exists as soon as a work is created or "fixed," the US Supreme Court recently held that copyrights cannot be enforced until a registration is obtained from the US Copyright Office.
The enterprise information management landscape is changing rapidly, driven by the need for business agility and scalability in the emerging digital economy. Effective information management is pivotal in properly managing input and output of information on a day-to-day basis.
The smart speaker marketplace—think Amazon Echo and Google Home—is growing like gangbusters, and digital marketers who are in the know, are bracing for its impact. The emerging category is evolving rapidly, and could affect the digital marketing landscape in a similar way that mobile devices and smartphones did a decade ago—except maybe faster and more pervasively.