AI is setting off a great scramble for data

user

August 14, 2023

AI is setting off a great scramble for data

Not so long ago analysts were openly wondering whether artificial intelligence (AI) would be the death of Adobe, a maker of software for creative types. New tools like DALL-E 2 and Midjourney, which conjure up pictures from text, seemed set to render Adobe’s image-editing offerings redundant. As recently as April, Seeking Alpha, a financial news site, published an article headlined “Is AI the Adobe killer?”

Far from it. Adobe has used its database of hundreds of millions of stock photos to build its own suite of AI tools, dubbed Firefly. Since its release in March the software has been used to create over 1bn images, says Dana Rao, an executive at the company. By avoiding mining the internet for images, as rivals did, Adobe has skirted the deepening dispute over copyright that now dogs the industry. The firm’s share price has risen by 36% since Firefly was launched.

Adobe’s triumph over the doomsters illustrates a wider point about the contest for dominance in the fast-developing market for AI tools. The supersized models powering the latest wave of so-called “generative” AI rely on gargantuan amounts of data. Having already helped themselves to much of the internet—often without permission—model builders are now seeking out new data sources to sustain the feeding frenzy. Meanwhile, companies with vast troves of the stuff are weighing up how best to profit from it. A data land grab is under way.

The two essential ingredients for an AI model are datasets, on which the system is trained, and processing power, through which the model detects relationships within and among those datasets. Those two ingredients are, to an extent, substitutes: a model can be improved either by ingesting more data or adding more processing power. The latter, however, is becoming difficult amid a shortage in specialist AI chips, leading model builders to be doubly focused on seeking out data.

Demand for data is growing so fast that the stock of high-quality text available for training may be exhausted by 2026, reckons Epoch AI, a research outfit. The latest AI models from Google and Meta, two tech giants, are believed to have been trained on over 1trn words. By comparison, the sum total of English words on Wikipedia, an online encyclopedia, is about 4bn.

It is not only the size of datasets that counts. The better the data, the better the model. Text-based models are ideally trained on long-form, well-written, factually accurate writing, notes Russell Kaplan of Scale AI, a data startup. Models that are fed this information are more likely to produce similarly high-quality output. Likewise, AI chatbots give better answers when asked to explain their working step-by-step, increasing demand for sources like textbooks that do that, too. Specialised information sets are also prized, as they allow models to be “fine-tuned” for more niche applications. Microsoft’s purchase of GitHub, a repository for software code, for $7.5bn in 2018 helped it develop a code-writing AI tool.

As demand for data grows, accessing it is getting trickier, with content creators now demanding compensation for material that has been ingested into AI models. A number of copyright-infringement cases have already been brought against model builders in America. A group of authors, including Sarah Silverman, a comedian, are suing OpenAI, maker of ChatGPT, an AI chatbot, and Meta. A group of artists are similarly suing Stability AI, which builds text-to-image tools, and Midjourney.

The upshot of all this has been a flurry of dealmaking as AI companies race to secure data sources. In July OpenAI inked a deal with Associated Press, a news agency, to access its archive of stories. It has also recently expanded an agreement with Shutterstock, a provider of stock photography, with whom Meta has a deal, too. On August 8th it was reported that Google was in discussions with Universal Music, a record label, to license artists’ voices to feed a songwriting AI tool. Fidelity, an asset manager, has said that it has been approached by tech firms asking for access to its financial data. Rumours swirl about AI labs approaching the BBC, Britain’s public broadcaster, for access to its archive of images and films. Another supposed target is JSTOR, a digital library of academic journals.

Holders of information are taking advantage of their greater bargaining power. Reddit, a discussion forum, and Stack Overflow, a question-and-answer site popular with coders, have increased the cost of access to their data. Both websites are particularly valuable because users “upvote” preferred answers, helping models know which are most relevant. Twitter (now known as X), a social-media site, has put in place measures to limit the ability of bots to scrape the site and now charges anyone who wishes to access its data. Elon Musk, its mercurial owner, is planning to build his own AI business using the data.

As a consequence, model builders are working hard to improve the quality of the inputs they already have. Many AI labs employ armies of data annotators to perform tasks such as labelling images and rating answers. Some of that work is complex; an advert for one such job seeks applicants with a master’s degree or doctorate in life sciences. But much of it is mundane, and is being outsourced to places such as Kenya where labour is cheap.

AI firms are also gathering data via users’ interactions with their tools. Many of these have some form of feedback mechanism, where users indicate which outputs are useful. Firefly’s text-to-image generator allows users to pick from one of four options. Bard, Google’s chatbot, similarly proposes three answers. Users can give ChatGPT a thumbs up or thumbs down when it replies to queries. That information can be fed back as an input into the underlying model, forming what Douwe Kiela, co-founder of Contextual AI, a startup, calls the “data flywheel”. A stronger signal still of the quality of a chatbot’s answers is whether users copy the text and paste it elsewhere, he adds. Analysing such information helped Google rapidly improve its translation tool.

Expanding the frontier

There is, however, one source of data that remains largely untapped: the information that exists within the walls of the tech firms’ corporate customers. Many businesses possess, often unwittingly, vast amounts of useful data, from call-centre transcripts to customer spending records. Such information is especially valuable because it can be used to fine-tune models for specific business purposes, like helping call-centre workers answer customers’ queries or business analysts spot ways to boost sales.

Yet making use of that rich resource is not always straightforward. Roy Singh of Bain, a consultancy, notes that most firms have historically paid little attention to the types of vast but unstructured datasets that would prove most useful for training AI tools. Often these are spread across multiple systems, buried in company servers rather than in the cloud.

Unlocking that information would help companies customise AI tools to better serve their specific needs. Amazon and Microsoft, two tech giants, now offer tools to help companies better manage their unstructured datasets, as does Google. Christian Kleinerman of Snowflake, a database firm, says that business is booming as clients look to “tear down data silos”. Startups are piling in. In April Weaviate, an AI-focused database business, raised $50m at a value of $200m. Barely a week later PineCone, a rival, raised $100m at a valuation of $750m. Earlier this month Neon, another database startup, raised an additional $46m in funding. The scramble for data is only just getting started. ■

https://www.economist.com/business/2023/08/13/ai-is-setting-off-a-great-scramble-for-data AI is setting off a great scramble for data