One of the things that makes LLMs so useful is that they can handle multiple tasks through a language interface. That is, to get an LLM to do something, all you have to do is ask! What made this intuitive interaction possible? It was a series of small advances that, one by one, unlocked the magic of LLMs.

The first advance came by reexamining the way we approach a sub-field of machine learning: supervised learning. It was commonly accepted that, for supervised learning, a model had to be trained for one specific task using a specific dataset that describes the task with multiple examples. For example, if we want to predict whether an email is spam or not based on the words in the email, we would collect a set of emails and label each one as “spam” or “not-spam”. This categorizing of text (emails) into groups (classes: spam or not-spam) is called text classification. This classic supervised learning task has been studied for decades with the same workflow: collect data, label each item in the dataset (by hand), and train a new model on that data — and often only that data — to perform the classification task.

The frameworks that train classification models are built very generally so that we can train models for any classification task. That means these frameworks must have the ability to handle classes with any name. To do this, models represent class names with integers (e.g. spam and not-spam might become 0 and 1). Classification models ignore the content of class names and just substitute them with their corresponding numbers. So the computer doesn’t really know it’s sorting spam from not-spam; it just sorts data into two buckets labeled 0 and 1.

But around 2019, LLMs drastically changed our approach to supervised learning.

The shift came from the observation that since an LLM can produce words as output, that output word can be the classification. If we are classifying an email as spam or not-spam, we would pass the email to the LLM, along with the prompt “Is this email spam?”. Then, we train it to produce the word “spam” if the email is spam, and “not-spam” otherwise. After reading the email, the LLM will have a probability associated with the phrases “spam” and “not-spam.” We can compare probabilities to find the most likely class for the email.

This works better than the 0 vs 1 classification models we previously used because it allows us to leverage the huge amount of text the LLM was pretrained on (more than just the specific email spam dataset), and also allows the model to see the semantics of the task through the prompt (“Is this email spam?”) and the actual class labels (“spam” and “not-spam” instead of 0 and 1).

Wrapped up in the first advance is the idea behind the second advance: that we can prompt an LLM by posing a question that provides additional information about how the LLM should respond. This idea unleashed a flurry of research on prompting, including prompt tuning and soft prompts. There was also intense interest from industry, including the posting of a “Prompt Engineer” job at Anthropic with a salary of up to $335,000 USD!

Researchers then realized that a prompt could be expanded to include multiple examples showing how the model should perform the task. This is called “in-context learning” and eliminates the need to fine-tune models on specific tasks. Instead, the model is exposed to training examples in the prompt (context), and no model updates are performed.

In-context learning is remarkable because it doesn’t require any training. Users can instantly direct models to perform a new task, without having to create a copy of the model specifically tuned for a new task, which is resource intensive. However, that is also the technique’s downfall: because we aren’t retraining, there are few mechanisms for improving the performance of in-context models aside from making the models bigger or tweaking the prompt. This raised the question: can we make these models better on adapting to new tasks, not just better on the particular tasks in the context?

This led to the third advance: the development of Instruction Tuning, which I’ve covered in depth previously. Models are trained on datasets that include multiple prompts (instructions) with multiple examples per prompt. This teaches the model to do tasks more generally, instead of focusing the training on one specific task.

The journey from task-specific supervised learning to the more dynamic instruction-following models of today highlights the transformative evolution in how we interact with AI. While multiple other innovations have supported the advancement of LLM usefulness (e.g. reinforcement learning from human feedback, or RLHF), the understated progression from numerical class labels to rich language prompts is one throughline I see in LLM history that seems to be less talked about. And it’s the reason why using an LLM is as easy as asking a question.

Alona Fyshe is the Science Communications Fellow-in-Residence at Amii, a Canada CIFAR AI Chair, and an Amii Fellow. She also serves as an Associate Professor jointly appointed to Computing Science and Psychology at the University of Alberta.

Alona’s work bridges neuroscience and AI. She applies machine-learning techniques to brain-imaging data gathered while people read text or view images, revealing how the brain encodes meaning. In parallel, she studies how AI models learn comparable representations from language and visual data.