Why Longer Contexts = Stronger Models
Something that seems very obvious in hindsight; pointed out over the course of a few thousand tokens.
A mistake a lot of people make is thinking about the capabilities of a model independently. For instance, take the size of a model’s context window. For most people, this number just acts as a limiter on the size of the documents they can paste into their conversation. For a handful of professionals, it might restrict the tasks they can use the model for — maybe your application absolutely requires >100k context. For people running models locally, larger context windows also mean that a model can produce much larger outputs, such as for creative tasks (API providers typically restrict output tokens to an arbitrary number, lower than what the model is capable of, to save costs). We think of it as just how much text we can show the LLM at once, rather than a fundamental quality that directly determines a lot of the LLM’s capabilities.
The size of the context window determines more than just the upper bound of the size of documents you can feed into a model. If the model isn’t stubborn like GPT-4, then the size of the context window largely determines how flexible and controllable the model is, as well as how well it can be adapted to new tasks. The more context, the more customizable it becomes. Why?
Because, more important than allowing longer chat sessions or letting you show a model massive documents, context windows determine how many few-shot examples you can show an LLM.
The more examples you can show a model, the more it can learn novel tasks it’s never seen anything like before.
The more examples you can show a model, the more you can make it learn difficult and nuanced aspects of tasks — even including writing style.
With long-enough contexts, you can use few-shot examples to enforce consistency in massive-length generation tasks.
It’s not about fitting the input for a task into the context window. It’s about having enough spare room that you can show the LLM how to do that task, two or three or four times. It’s about being able to show the task enough times that even your wildest, most complex use cases can be done at a consistent level of quality, for production, without any finetuning.
My largest prompt ever is also one of my most powerful. It’s more than 20 thousand tokens long. The thing’s a behemoth meant to generate outputs around 4 thousand tokens long. The instructions are so complex and nuanced that the system prompt is about 3.5k tokens too, and this is after a comprehensive edit.
But because I’m using models with big enough context windows, the two few-shot examples pull their weight, and it works. Consistently. Even the writing style barely feels like an AI, because they’ve picked up on the nuances after having seen just so much damn text.
That story is meant to show that context = flexibility and power. More context means more examples means more tasks can be taught to the model, to a higher level of accuracy. This is a fundamental edge Mistral and Command R have over Llama, and it’s why I still use those models for my hardest tasks, even though Llama 3 is extremely smart.
They can learn more about the task, so they’re more capable if the task is novel.
Context goes way beyond what the size of the input can be.
It may seem obvious that more context means more examples, but the quantity of examples is so core to what you can do with a prompt — and therefore so core to what you can do with a model — that I feel the importance of having that room to work with is underappreciated. Hence, I just spent a few thousand tokens here myself talking about why context is key for performance on the hardest tasks.
LLMs are great BECAUSE you can prompt them: because you can show them how to do something new with instructions and examples. More context = more instructions and examples = the LLM is better at doing what makes LLMs great. It’s that simple. At least, that’s how it seems to me.
This post is less a deep dive into a new principle, and more an introduction to a different perspective on a key aspect of LLMs. Perhaps it’s timely, because lots of new LLMs have been coming out lately. I hope this was useful to you and that seeing context in this way inspires you to take on even tougher and more rewarding prompting challenges with your open-source models.
I write a lot about prompting but really 90% of it is few-shot examples, after all.
One brief question towards you all before I sign off. I’m curious: how many of you know or follow me outside of substack? I’m interested to know whether the people who know me on substack know of me elsewhere (for instance, through Augmentoolkit) or if my writing has attracted a whole different organic audience.
Thanks for your time. I hope this was useful, have a good one and I’ll see you next week!
I'm really happy with the collection of articles in this substack. Have gone through them all in a binge. A few I read twice! I really appreciate the attention to differences in models.