Some problems require more thinking.

This isn’t surprising: 5+5 requires less thinking than 5*(67+99)/700+4/5. But you might be surprised to hear that we used to constrain the amount that transformer-based Large Language Models (LLMS) could think before answering any question.

What do I mean by an LLM “thinking”?

In this case, it’s something relatively simple: the model performs a set number of mathematical operations to predict the next word in a sequence. The number of operations a model can perform is loosely related to its size, and is akin to the amount of thinking a model can do. LLMs are large, but their size is fixed, and so their thinking power is bounded. This works fine until problems get to a certain size or complexity.Then we start to hit the wall of what an LLM can do.

Making the model bigger seems like a simple solution! Unfortunately, training bigger models brings technical challenges and additional computational costs. And, as we make models bigger, the data required to train them also grows. So, while AI companies are creating larger and larger AI models, it’s not an easy fix.

So what can we do if we want our models to think more without actually increasing the model size? For an LLM, the only way to think more is to add more steps of computation. One way to increase computation (without training a bigger model) is to have the LLM produce intermediate tokens before its final answer. This is Chain of Thought (CoT), developed in part by my colleague at Amii, Dale Schuuurmans.

You can see more from Dale on Chain of Thought in this talk he gave recently:

To train a model to do CoT, we give it examples of reasoning at training time. Then, the model learns to produce intermediate words that give it more space to perform computation.

Allowing LLMs a little space to think had a huge impact on performance for some types of problems. Researchers tested it with GSM8K, a collection of grade school math problems, and saw a huge increase in performance. CoT also produced performance gains on commonsense reasoning and symbolic reasoning datasets.

CoT provides more than just an increase in thinking power. It also opens up new kinds of computation. In particular, CoT allows models to re-analyze their past steps. That re-analysis is actually incredibly powerful and is required to solve some kinds of problems. For example, with CoT an LLM could detect if there is a route between two cities in a network of roads, but could not reliably solve that problem without CoT.

Chain of Thought is a simple but powerful idea: some problems require only a little bit of thought, and some require more. More importantly, CoT helped computer scientists think carefully about what transformer models were capable of, which made it clear why LLMs could solve some problems but not others.

Alona Fyshe is the Science Communications Fellow-in-Residence at Amii, a Canada CIFAR AI Chair, and an Amii Fellow. She also serves as an Associate Professor jointly appointed to Computing Science and Psychology at the University of Alberta.

Alona’s work bridges neuroscience and AI. She applies machine-learning techniques to brain-imaging data gathered while people read text or view images, revealing how the brain encodes meaning. In parallel, she studies how AI models learn comparable representations from language and visual data.