In the spring of 2017, in a room on the second floor of Google’s Building 1965, a college intern named Aidan Gomez stretched out, exhausted. It was three in the morning, and Gomez and Ashish Vaswani, a scientist focussed on natural language processing, were working on their team’s contribution to the Neural Information Processing Systems conference, the biggest annual meeting in the field of artificial intelligence. Along with the rest of their eight-person group at Google, they had been pushing flat out for twelve weeks, sometimes sleeping in the office, on couches by a curtain that had a neuron-like pattern. They were nearing the finish line, but Gomez didn’t have the energy to go out to a bar and celebrate. He couldn’t have even if he’d wanted to: he was only twenty, too young to drink in the United States.
“This is going to be a huge deal,” Vaswani said.
“It’s just machine translation,” Gomez said, referring to the subfield of A.I.-driven translation software, at which their paper was aimed. “Isn’t this just what research is?”
“No, this is bigger,” Vaswani replied.
Gomez found Vaswani’s view puzzling. The work the team was pursuing involved a novel kind of neural-network architecture that they called the transformer. Their paper showed how the technology could be used to advance the state of the art in automated translation. But Vaswani seemed to have something else in mind.
Two weeks later, Gomez had returned to school, at the University of Toronto, when he received an e-mail from Łukasz Kaiser, his supervisor on the team, with the subject “Generated Wikipedia Articles.” Kaiser explained that the team had used their transformer-based A.I. model to read Wikipedia, giving the system two days to analyze a little less than half of its entries. They’d then asked it to create five Wikipedia articles for “The Transformer.” The system had responded with fictitious text that was shockingly credible. It described “The Transformer,” a Japanese hardcore-punk band formed in 1968; “The Transformer,” a science-fiction novel by a (fictional) writer named Herman Muirhead; “The Transformer,” a video game developed by the (real) game company Konami; “The Transformer,” a 2013 Australian sitcom; and “The Transformer,” the second studio album by an alternative metal group called Acoustic. None of these Transformers existed, yet the A.I. had written about them authoritatively.
Gomez’s first thought was, How the fuck? The generated Wikipedia entries were filled with inconsistencies, but they were also strikingly detailed. The entry for the punk band offered a lengthy history: “In 2006 the band split up and the remaining members reformed under the name Starmirror.” Where had these details come from? How did the system decide what to write? And why was a neural network designed for translating text capable of writing imaginative prose from scratch? “I was shocked, blown away,” Gomez recalled. “I thought we would get to something like this in twenty years, twenty-five years, and then it just showed up.” The entries were a kind of magic, and it was unclear how that magic was performed.
Today, Gomez, who is now in his late twenties, has become the C.E.O. of Cohere, an artificial-intelligence company valued at five and a half billion dollars. The transformer—the “T” in ChatGPT—sits at the core of what may be the most revolutionary technology of the twenty-first century. PricewaterhouseCoopers has estimated that A.I. could add $15.7 trillion dollars to global G.D.P. by the year 2030—a substantial share of it contributed by transformer-based applications. That figure only gestures toward some huge but unknown impact. Other consequences seem even more murkily vast: some tech prophets propose apocalyptic scenarios that could almost be taken right from the movies. What’s mainly certain, right now, is that linguistic A.I. is changing the relationship between human beings and language. In an age of machine-generated text, terms like “writing,” “understanding,” “meaning,” and “thinking” need to be reconsidered.
A.I. that can create and comprehend language carries the shock of a category violation; it allows machines to do what we thought only people could. To a great degree, the researchers at Google experienced that shock as much as anybody else. The period leading up to the creation of the transformer was like an accidental Manhattan Project. Conversations with its inventors suggest that, seven years later, we remain uncertain about why it’s as effective as it’s turned out to be.
A few years earlier, the tech world had started taking A.I. seriously largely because of innovations in image recognition. But the Google team—Gomez, Vaswani, Kaiser, Llion Jones, Niki Parmar, Ilia Polosukhin, Noam Shazeer, and Jakob Uszkoreit—shared an obsession with language, and a common conviction that it was the path toward a broadly capable artificial intelligence. Shazeer told me that, in terms of the insights it contains, a passage of text is “a thousand times as dense” as a picture. The team approached language mainly through translation because, in addition to being valuable in itself, it made a good A.I. research target. A metric called BLEU allows computer scientists to assess the similarity between machine-translated text and high-quality reference translations done by humans. In the early twenty-tens, before machine learning had matured, many researchers, including Kaiser, worked on a translation technique known as parsing, centered on the automated creation of sentence trees—the sprawling diagrams of grammatical dependencies that schoolchildren once learned to make. These syntax-based systems usually achieved adequate BLEU scores—twenty-one, say, for English-to-German translation, with the best rising as high as twenty-three. At that time, figuring out how to create a one-point increase was generally enough for a successful dissertation.
Computerized translation was notoriously inefficient. A.I.-based systems struggled with the sequential aspect of language, which consumed huge quantities of processing power. A typical recurrent neural network proceeded through a sentence from beginning to end. “It would work one word at a time,” Polosukhin told me. “Read one word, process it. Read next word, process it. Read next word, process it. If you have a thousand words, you have to wait for a thousand cycles.” One of the team’s goals, therefore, was to build a system that could process language while avoiding the time-intensiveness of sequentiality.
On its face, asking language to make sense without word order seems impossible. We speak words one at a time, and write and read that way, too. But our intuitions about how language works may not reflect what really goes on inside our heads. “How do you know you’re purely sequential?” Vaswani asked me. Anyway, he continued, “why should you impose your restrictions on a machine?” Several ideas were already floating around about how to avoid sequentiality, including convolutional neural networks, which respond to data out of order. Polosukhin described an approach called “bag of words.” “Imagine you open a Wikipedia article and you scramble all the words, and you try to answer a question from that,” he told me. If you saw the sentence “Your mother burned her hand on the stove” as “burned hand her the on stove mother your,” you’d still get the general idea. And yet that might not be true for more complex sentences. Nonsequential methods were faster, but they risked losing coherence.
For many years, A.I. researchers had experimented with a mechanism called attention, which they hoped might be capable of bridging the divide between efficiency and coherence. Attention allows a neural network to dodge sequentiality by seeking relevance. Instead of looking at each word in order, attention looks at all the words in a piece of text together, evaluating how they are interrelated and which are most important to each of the other words, as it captures the over-all meaning. This is closer to the way people remember a text than the way they read it. If you try to recall the opening paragraph of this article, you might articulate a vaguely connected constellation: Aidan Gomez, couldn’t drink, intern, Google, the uncertain potential of a new technology. Those terms, in any order, might amount to the sense you have retained.
In the past, researchers had often combined attention mechanisms with other systems that tried to take into account the convoluted nature of language. But the Google team realized that attention had a singular and important technical advantage that earlier researchers hadn’t leveraged: employing it relied on a relatively simple mathematical operation called matrix multiplication—the multiplication of one table of numbers by another. “The chips we use, they do one thing really well, and that is matrix multiplication,” Gomez told me. If an A.I. system could be built with attention only, forsaking other components, it could work with unprecedented speed.
Before submitting their paper to the Neural Information Processing Systems conference, the team decided to title it “Attention Is All You Need.” “The way a transformer works is, take, let’s say, a sentence . . . and then attention is used to find which words are relevant and then pass that on to the next layer,” Polosukhin explained. The process is repeated through several layers and, at the end, what emerges is constantly improving text prediction. The efficiency of this process allows transformer-based models to easily scale from a single chip on a desktop to a data center with many thousands of processors; moreover, Kaiser said, for reasons that are still being studied, “transformers yield very good and predictable results when scaling up their size.” The network, meanwhile, learns on its own by identifying patterns in the data it examines. “You don’t prescribe what relationships it learns; you don’t say, ‘Your job is to learn associations of adjectives to nouns,’ ” Gomez said. “You just give it the ability to learn whatever it wants.”
Unfortunately—or fortunately, depending on how you look at it—transformers don’t imitate how the brain works. The transformer’s objective is to learn how to continue text, which it does by establishing relationships between “tokens”: collections of letters, punctuation marks, and spaces. It has no built-in grammar or syntax. It uses an algorithm called backpropagation to improve itself, but as a model of how the brain learns, “backpropagation remains implausible despite considerable effort to invent ways in which it could be implemented by real neurons,” Geoffrey Hinton, the “godfather of A.I.,” wrote, in a 2022 paper. The quest at the beginning of artificial intelligence—to understand how the human mind works—remains as unsolved as ever.
Late in the project, only a couple of weeks before the submission deadline, Parmar and Vaswani were sitting in the Google lobby in Mountain View when they learned that their team’s transformer-based model had attained a BLEU score of more than twenty-six points on English-to-German translation. “Facebook had put out this paper before us, and that was the number we were trying to beat, and it had taken them days to train, and for us it had been a matter of hours,” Parmar recalled. Moreover, the Google team had used a small and primitive transformer-based network; this suggested that, with more resources, their results could quickly be improved. (The model’s final score was 28.4.) In their excitement, Vaswani and Parmar called Uszkoreit, who was driving his four-by-four down from paragliding in the mountains. “Jakob had some old champagne in his car,” Parmar said; they toasted their success with warm bubbly beside the dusty vehicle in the company parking lot.
In the final days, Kaiser, who’d been pursuing, with Gomez’s help, a “unified model of everything”—a neural network that could be trained on images, audio, and text, and then generate new content in the same range of modalities—made a small but vital addition to the paper: he tried training a transformer-based model not just to translate but also to do the old-fashioned work of parsing, and found that it could learn that skill, too, training on a relatively small number of examples. This showed that the model could perform multiple linguistic tasks, working with language generally, rather than with just one of its aspects: it wasn’t just a translation machine but a language machine. Even so, no one expected that the underlying transformer technology would soon be used to build models that could plan vacations, draft and grade undergraduate essays, and replace customer-service representatives.
The true power of the transformer became clearer in the next few years, as transformer-based networks were trained on huge quantities of data from the Internet. In the spring of 2018, Shazeer gave a talk titled “Bigger Is Better,” arguing that scaling transformers led to dramatic improvements and that the process did not appear to plateau; the more you trained the models, the better they got, with no end in sight. At Google, Shazeer was instrumental in developing the LaMDA chatbot, which holds the dubious distinction of being perhaps the first large language model that some poor soul believed to be sentient. At OpenAI, the ultimate result of scaling up was ChatGPT.
If transformer-based A.I. were more familiar and complicated—if, say, it involved many components analogous to the systems and subsystems in our own brains—then the richness of its behavior might be less surprising. As it is, however, it generates nonhuman language in a way that challenges our intuitions and vocabularies. If you ask a large language model to write a sentence “silkily and smoothly,” it will produce a silky and smooth piece of writing; it registers what “silkily and smoothly” are, and can define and perform them. A neural network that can write about Japanese punk bands must on some level “understand” that a band can break up and reform under a different name; similarly, it must grasp the nuances of the idea of an Australian sitcom in order to make one up. But this is a different kind of “understanding” from the kind we know.