Let’s be honest, ChatGPT is surprisingly good. As someone who has worked with LLMs (and their predecessor, just regular old LMs) for nearly a decade, I was surprised by ChatGPT’s ability to reason and pick up on patterns. I once spent two years with a student trying to teach an LM to write properly rhyming limericks. These days, ChatGPT can write you a perfect limerick about nearly any topic.

It’s clear that today’s LLMs are closer than ever to human-level language skills, and they continue to improve. For example, Open AI recently reported on a model that is now able to solve remarkably difficult language-based math problems. And yet, the way LLMs learn language is nothing like how humans learn language. How can LLMs master language so completely if the learning process is so different?

The Disparity of Language Learning Acquisition

If our approach to teaching babies language mirrored how we train LLMs, it would be the equivalent of sitting a baby down in front of a web browser that pulled up random web pages all day long. This is because we train LLMs by showing models millions of randomly selected samples of text. There is no order to the data; models don’t start on the web pages with the simplest language. It is as likely that an LLM starts with a technical manual or CNN as it is to start with a website actually written for young children. And LLMs are trained on more than a trillion words of text. This is 1000x more words than a typical 13-year-old has experienced in their lifetime.

How do infants experience language in the world? Part of the language that infants experience is what we call child-directed speech. In Western cultures, child-directed speech is marked with a different intonation and cadence. It's often referred to as baby-talk, and, compared to adult-directed speech, baby-talk sentences are simpler, shorter and delivered in a sing-song tone. The words are also more enunciated, paired with exaggerated facial expressions to help infants learn words. When talking to babies, we also take into account which content is understandable to an infant so it’s likely geopolitics is probably not mentioned, but words related to toys, food, or the current scenery are often used. Child-directed speech is specially curated and marked in a way that adult-directed speech is not.

So LLMs and babies learn with very different inputs. Language models experience jumbles of unrelated text of varying difficulties for millions of iterations. Children receive much less directed input, but it is scaled and marked in a way that supports learning. And so, unsurprisingly, the trajectory of learning is very different for babies and LLMs.

Measuring Language Learning Trajectories

This raises an interesting conundrum: For the age ranges of early language learning, babies cannot reliably say all of the words that they understand. So if a baby can't say a word, how can we know that they understand it? The MacArthur-Bates Communicative Development Inventories (MB-CDIs) tackle this problem using a parent's report of their children's word knowledge. Though there are drawbacks to using parental reports (parents may over or underestimate the number of words their child knows) there are also significant benefits. It is much more efficient to ask parents about the words their children understand rather than to perform some kind of laboratory test of word understanding for dozens of words. And though parents may not perfectly understand their children's word knowledge, averaged over many parent reports, a clear picture of word knowledge emerges.

Wordbank is a collection of many MB-CDI reports from parents around the world. It allows us to explore word learning across multiple languages and language environments. For our purposes, let’s focus on monolingual English infants. Based on parent reports, Wordbank produces learning trajectory graphs that show the percentage of babies reported to have learned a word by a certain age. The age of acquisition for a word is defined as the point at which 50% of babies are reported to know a word. Learning trajectory graphs show us that babies tend to learn nouns first, specifically nouns that are pertinent to their experiences (E.g. mommy, daddy, bottle, hi). The learning of function words (words that have a grammatical purpose but carry less semantic meaning) comes much later. Below is a graph that shows the learning trajectories for some of the first words babies learn (mommy, hi, yum yum), vs. the first words LLMs learn (on, you, his).

The Case for Human-Like LLM Training

How do we measure when an LLM acquires a word? Taking inspiration from studies of infant word knowledge, Chang et al. (2022) created a metric that mirrors the 50% cutoff used for baby age of acquisition measures. Based on this metric, the authors found a very different pattern of acquisition. Language models learn words in a way that is strongly tied to their frequency in the text that they are trained on (Chang 2022). That is, LLMs learn the words they see most often before they move on to learning more rare words. The most frequent words in language tend to be function words because they are used in many different contexts.

But does this matter? If LLMs eventually learn to use language fluently, why does it matter how they learned? It matters because LLMs are not perfectly human language users. For example, they require significant guardrailing to avoid producing offensive language. This is, in part, because LLMs learn language outside of the cultural context in which babies learn language. And careful experiments show that LLMs are likely still relying on memorization rather than reasoning in some cases. Perhaps the factors leading to these differences lie in differences in training regimes.

Training LLMs is also extremely computationally intensive. It takes a huge amount of capital to train these powerful models, which means that large corporations create and own our most powerful LLMs. This creates an imbalance in the marketplace that makes it difficult for small companies to take hold. LLMs are expensive to train in part because of the huge amount of energy required to run the computers that do the training. Training just one LLM can produce as much carbon as two round-trip NY-LA flights. If we train LLMs in a more human way, they may require less computational power.

Researchers have heeded these calls. Last year (2024) was the second iteration of the challenge, in which teams of scientists competed to train the most accurate LLMs using only a fraction of the data. This competition encourages researchers to explore training LLMs in more creative ways, taking inspiration from the most efficient language learners we know: human babies.

Like all creatives, good scientists question the status quo. Making LLM training more efficient and more human-like flies in the face of most leading-edge LLM research. But more human-like training is one way we might discover the next, more human, ChatGPT.