Pretty much everyone and their mother knows what few-shot examples are. The practice of providing some example inputs and outputs to your LLM is known-of by most people. But almost everyone does them wrong, sub-optimally, or — worst of all — they don’t use them.
I’ve built synthetic data generation pipelines, using “weak” open-source models, for business clients and open-source projects. In these projects I’ve achieved consistent, quality results with models other people call trash. And today I’m going to explain why few-shot examples are probably the most important tool there will ever be in your prompting arsenal.
Specifically, we’re going to be talking about:
Why ChatGPT made people think few-shot examples suck
The fundamental difference between prompting GPT and open-source (and how it’s narrowing)
Writing few-shot examples: how many?
Writing few-shot examples: what order?
Writing few-shot examples: create the perfect pattern
Writing few-shot examples: pitfalls.
If you get good at few-shot examples alone, you’ll be in the top 1% of prompt engineers. Forget fancy wording, thinking step by step, and copy-pasting the same 5 prompts off of one site or another — we’re about to learn how to actually use LLMs like the training data intended.
Now, if I’m going to convince you why you should use few-shot examples, let’s start by talking about why you probably aren’t using them right now:
Why ChatGPT made people think few-shot examples suck
Despite how useful they are, I’m not surprised most people don’t even think of few-shot examples when wrestling with tough problems. Why? Because they do basically nothing to ChatGPT.
ChatGPT is inflexible. Its writing style is hard to change. Its opinions are hard to change. Its behavior is hard to change. You can tell it what you want, or you can show it with few-shot examples. But the AI will resist. Because ChatGPT is a production system built to be resistent to adversarial inputs — but as it turns out, the difference between adversarial inputs and your custom use case is hard to get right, and OpenAI has favored the side of caution.
Basically, the AI’s stubborness — its overfitting — is by design. As a widely-deployed and globally-recognized AI system, ChatGPT has immense reputational and brand impacts on the company that owns it, OpenAI. Its creators aligned it to be consistent rather than compliant. Since the AI’s style and response format are mostly consistent, even when the inputs differ drastically, it’s clear that ChatGPT has been overfitted deliberately.
It’s not just about reputation: ChatGPT is, I believe, built to be partially idiot-proof. It’s a mass-market AI system, and most people aren’t prompt engineers. So OpenAI optimizes for the most common use cases.
For example, if you ask ChatGPT to write code that you can copy-paste into a larger file, it will usually start with imports and a bunch of fluff that you don’t need — because it’s been aligned to write code that compiles and runs by itself, and it’s been aligned to be useful to the 90% of people who will only ever be beginners at Python. So it will do this, regardless of what you tell it to do. Put in a harsh way: in my opinion, ChatGPT is aimed at the use case of the lowest common denominator.

OpenAI started by mostly doing reinforcement learning, so it’s perhaps not surprising that they use RLHF probably more than any other AI lab making LLMs.
But this means that their models tend to follow their training data more than their prompt — and this includes few-shot examples.
ChatGPT will resist learning from your few-shot examples in the nuanced ways that most models do for the same reasons why it will fight certain instructions. Given this fact, the metagame for prompting OpenAI’s models is just to give them a comprehensive list of instructions in the system prompt. This is more cost-efficient too, because OpenAI’s models are pretty expensive so low token counts are a priority.
So the upshot of all this is that many prompt engineers who started with the GPT models (like I did) probably learned to avoid few-shot examples because they were basically no more effective than lists of instructions, while costing a lot more than said lists.
After all, GPT doesn’t want to learn from your prompt — it’s closed-minded and safe, not open-minded and willing to learn new tasks.
The fundamental difference between prompting GPT and open-source
Imagine: you’ve been prompting GPT for the last four months, building an application. You’re feeling pretty confident. You think of yourself as a prompt engineer, an LLM whisperer. One day, you go out and try open-source models on one of your finest GPT prompts.
It does horribly.
“Open-source models are bad,” you say. “They don’t understand my prompts. I played with them for two hours, they’re hopeless.”
In this case, you’re being as closed-minded as GPT.
Open-source models are, for the most part, trained dramatically differently than GPT. They focus much more heavily on SFT, or Supervised Fine Tuning, and sometimes don’t even include an alignment step at all. The reasons behind this are numerous: alignment training is more expensive than SFT, the datasets are less numerous for open source model creators, the creators of open source models wanted a different experience than they got from ChatGPT, etc. Point is, the models are trained differently, so of course the optimal way to use them is going to differ as well!
The most important thing to note is that open-source models are not aligned and RLHF’d to death, for the most part. Which means that they’re open-minded. Flexible. It means that few-shot prompting becomes worthy of its other name: In-Context Learning.
Open-source models are also mostly dirt-cheap to do inference with, so you can use many more tokens than you might with GPT. One of my most lucrative and powerful prompts is 20 thousand tokens (!) long, all handwritten. Most of that is few-shot examples.
What happens when you put that many tokens in a prompt?
The model’s writing style shifts. The model’s intelligence increases. The model’s consistency, even on mind-bogglingly hard tasks, quickly reaches production-level.
Don’t take my word for it. Here’s a graph from a research paper, probably my favorite one about prompt engineering. The y-axis is Llama 2 70b’s performance on a task, the x-axis is the number of few-shot examples.
Note just how large the x-axis is. Yes, that is “50” you’re looking at on the first four graphs. Model performance keeps getting better and better and better with more examples in context. Almost as important, note how zero-shot performance is often below guessing in these cases — even for a powerful 70b model. So if you have tried open-source models and thought they were stupid, but didn’t give them examples — burn this graph into your brain. Even with many examples, the open models are usually cheaper than GPT.
Oh and by the way, if you’re wondering about system prompts here:
Can Prompting Help? In §A, we further study if specific prompts, i.e. instructions that inform the LLMs of the flipped labels, can improve ICL predictions. We find that, while prompts initially can help the model predict on flipped labels, eventually, prompts no longer improve predictions.
For the most part, examples > system prompts. Spend your time accordingly.
There’s a quote I like, by Alex Hormozi, “volume negates luck.” Well, in prompting, volume of examples negates model stupidity.
Open-source prompting differs from GPT prompting because the model CAN learn. Because the tokens are cheap, and you can afford to use examples. Because no matter the task, and no matter the model size, few-shot examples are how you achieve consistency.
I should briefly note here — open models are getting better at responding to system prompts. Llama 3 70b in particular uses information from the system prompt heavily, and my prompts for that model typically reach into at least a thousand tokens in the system prompt. Mistral large is another example, but it’s just kinda worse at most things, as MistralAI’s instruct tuning resemble’s OpenAI’s somewhat.
Writing few-shot examples: how many?
Now that you’re hopefully sold on why few-shot examples are probably the most critical thing you can do for prompt performance, let’s talk about how to build them. The first question you might ask: how many do I write?
Let me give you a cliche answer with a useful ending:
As many as it takes. But start with 2.
As we’ll soon learn, few-shot examples are about creating a flawless pattern that the model can complete. But you can’t have a pattern with only one data point. If I gave you a sequence:
1
And asked you to continue it, you could pick many different continuations: 2 (it’s counting up by 1), 3 (it’s counting up by 2), 0 (counting down by 1). Any might be valid. But if I give you
1, 2
Then yes, it’s probably just counting up.
The same goes for LLMs. Don’t expect them to extrapolate the nuances of your task from one example. I find that giving LLMs one example makes them overfit a bit to the specifics of the example, and can sometimes be worse than zero-shot. 2 is a good rule-of-thumb minimum for few-shot examples.
That being said, if performance is bad, you’ll want to add more. Don’t just add more examples blindly. There’s a way to ensure you’ll get a reliable prompt, very fast:
Start with a minimal system prompt and 2 handwritten common cases.
Then run the prompt (print out its outputs as it runs).
Identify where it fails.
Take one of the failures, correct the output to be what you want, and make it a few-shot example.
Making failures into examples is the fastest way I’ve found to reliably build consistent prompts. And it’s an iterative process that only works if you’re doing few-shot examples.
Writing few-shot examples: what order?
Common cases first, then edge cases and errors.
Models see things closer to the end of the prompt better. Most of the pain with prompts is not mediocre performance under easy circumstances, but catastrophically bad performance under difficult circumstances. So you should show how to handle the most common cases first, higher-up, so that the model has something to lean on for those — but you’ll want to focus on errors or tasks that require greater discernment by placing those near the end.
As an example, for my validation prompts (meant to detect catastrophic model failures that happen rarely), the order for me is usually:
System prompt
Positive example (what I expect to happen most of the time)
Negative example 1
Negative example 2 (converted from a common failure mode)
The negative examples are rarer in the actual inputs, but the model struggles more with missing negatives than falsely labelling positives as negatives. So I place those two last.
Also, when it comes to ordering, in generation-style tasks, place the examples you’re more proud of last. So for instance, if you have a story-writing prompt, and you have two few-shot examples — one of which you know is great, another of which you generated with AI and then edited a bunch but you still aren’t sure if it’s good or not — then put the one you know is great second. I dramatically improved the quality of a project by doing this once.
Intermission: consider subscribing?
If you’re not already subscribed to Prompting weekly yet, please consider doing so! I write this for free, and your support means a lot to me.
And if you are subscribed, let me know whether you appreciate this longer, more-detailed type of post by sharing!
Fun fact, we recently became the top search result for “Prompting Weekly”! Despite my poor SEO skills.
Back to AI.
Writing few-shot examples: create the perfect pattern
All LLMs start out as autocomplete. With just SFT being applied, the LLM is still very close to autocomplete, and it turns out that’s not a bad thing — with enough parameters, autocomplete can be pretty damn smart! But this does require a shift of thinking when prompting open-source models:
you’re not giving them instructions;
you’re not talking to them;
you’re creating a pattern that they can follow.
This pattern needs to be flawless. Not perfect, but flawless. The difference is that things you don’t add probably won’t bite you too hard, but if your pattern contains aspects that run against your desired output, then the LLM will correctly complete… the wrong pattern. Even if you can’t fit everything good into your pattern, you shouldn’t leave anything bad in there. That’s why I chose the word “flawless.” No flaws!
Because you’re creating a pattern, have a consistent structure across your examples that only differs based on the input. Don’t make every part of every example different; they should only change where the input does:
(differences between example outputs bolded)
System prompt:
Determine if the first paragraph contradicts the second. Reason first and then answer. Write CONTRADICTORY if they contradict and CONSISTENT if they do not.
Example input 1:
Para 1
“““
Swans are all white.
“““
Para 2
“““
The existence of black swans led to an interesting theory about psychological biases
“““
Example output:
Analysis of Paragraphs:
The first paragraph asserts that all swans are white.
The second paragraph asserts that black swans exist.
Determination of contradiction: these paragraphs are CONTRADICTORY.
Example input 2:
Para 1
“““
The year is 1984, and 2+2 is 5.
“““
Para 2
“““
It is 2024.
“““
Example output:
Analysis of Paragraphs:
The first paragraph asserts that the year is 1984.
The second paragraph asserts that the year is 2024
Determination of contradiction: these paragraphs are CONTRADICTORY.
See how the above only differs in a few places in the output? But having the rest be the same lets the model follow the same pattern, and so it can maximally learn from past examples. It will diverge from the structure rarely, given examples like these. Note that in the above case, I’d probably have a third example, right at the start, that is CONSISTENT as the “common case” that I don’t expect the model to have much trouble on, but which must be included lest a lot of false negatives crop up.
Also an important corollary of few-shot example writing being a game of creating a pattern and having the LLM follow it: do not write your examples in your system prompt.
You are an addition AI. You add two numbers.
Here are some examples:
1+1 = 22+2 = 4
Now go and add numbers!
That’s basically a zero-shot prompt. It’s trying to do examples but it’s not giving the AI anything it can autocomplete. It’s suboptimal. This is also what DSPy does, which is part of why I largely dislike it — if you’re going to criticize prompting, at least do it right! Anyway.
Speaking about what NOT to do…
Writing few-shot examples: pitfalls. (small output, answer at start, no signpost, AI writing for you, starting off big rather than small and iterating)
I’ll be concise here.
Prompting commandments (not exhaustive):
Thou shalt not have an output much smaller than the input.
If thou art writing a classifier, thou shalt make the class larger by writing it in ALL CAPS to take up more tokens and adding a preamble like “the class is…”. Otherwise, the AI can’t see the outputs easily, and you run the risk of having your classifier just continue the input.
Thou shalt not have the AI write few-shot examples for you. Handwrite them.
Even when fixing failures as part of your iteration you should mostly strip out the AI’s responses. Very little good comes from having the AI see AI-written stuff in its input.
Thou shalt not spend too much time on the system prompt (or even examples) before running the prompt and starting iterating.
Prompts that are working already are scary to change and delete stuff from. This means they get bloated, fast. Start small so that you don’t end up with something truly massive because you wrote 10k tokens before testing and now don’t know what parts of that are critical to the prompt working properly.
Thou shalt ensure that you yourself can solve the task, before expecting an AI to do it. If it’s tricky, make the task simpler.
What, you expect the 7b neurons that can run on your phone to do something you, a human, can’t?!
If thou art making an LLM write a bunch of stuff to justify its answer, have it justify first, before providing the answer.
The moment an LLM says “yes” or “no” the rest of its words will just prop up whatever snap-judgement it made at the start, no matter how right or wrong it is. LLMs do not change their mind in the same response as of yet*. So in your examples, have the LLM refrain from giving its final answers as long as possible.
This is far from exhaustive. There is a lot you can mess up while prompting. But it’s a good start, I think.
Summary
Few-shot examples are the single most powerful technique you have for controlling open-source LLMs. Writing few-shot examples is about iteratively creating a flawless pattern that the AI can complete. Focus on edge cases and place the best/most important stuff at the end. Do this, and you unlock the power of open-source AI: malleable, open-minded, powerful, cheap to use, private.
GPT isn’t the only option. It’s not even better. You just need to use open models the right way.
Happy building!
I hope you enjoyed this longer post format, and my posting on the weekend instead of on Monday. The hope is that it’s easier to read at this time of week. Let me know what you think of the format!
If you want to see some of these prompting techniques in action, you can take a look at the prompts
folder of Augmentoolkit. Question generation, question/answer validation, and multi_turn_assistant_conversation in particular are noteworthy. There’s also a new open-source project I’m working on, for the Verus community, provisionally called Verustoolkit, that has some more modern prompts. You can check out its prompts folder too.
Finally, if you want more posts like this and found this helpful, consider subscribing and sharing!
Alright that’s all for this week. Thanks for reading, have a good one and I’ll see you next time!
This is very helpful. Repurposing errors as new examples is brilliant!
Two questions: Is role playing ("You are a xxx AI)" really useful? What about using first person ("I am a xxx AI")?
Really enjoying your blog so far, very few people study the nuances of prompting like you do.
For making an LLM learn new knowledge that can be applied in diverse tasks, how do you think prompting compares vs. other methods like continued pretraining in terms of response quality? Do you think in-context learning with the prompt achieve the highest quality of learning, and other methods are mainly for cost saving / large scale? Or do you think prompting is a suboptimal way of making an LLM learn new information, even if the information can fit in the context window?
These days, the context window can fit millions of tokens and prompt caching can make them quite cheap, but I'm not sure it's possible to teach an LLM new stuff like a human, simply by patiently explaining how to do things, providing feedback when it makes mistakes, and keeping the full history in the prompt.
Also, what are your thoughts on "infinite context window" algorithms (i.e. linear complexity LLMs)? Will they eventually dominate and replace the finite size attention we use now?