📖 PROMPTING PRINCIPLE: Models See Bigger Things Easier. Try this on classification!

This is VERY IMPORTANT for classification. See what I did there?

Mar 04, 2024

Once you learn how LLM tokenization works, it may seem intuitive that when doing classification tasks with LLMs, you should take special care to make sure that each label is a single token, like “True” or “False” or “Yes” or “No”. Both because a single token will concisely contain any information about a concept you could want, and because it’s pretty fast to generate a single token. Everywhere from bot RP guides on Rentry to research papers using LLMs as classifiers seem to favor single-token labels for classification.

What if I told you this approach — which intuitively makes sense — dramatically reduces the accuracy of your model when it’s classifying large blocks of text?

I discovered this somewhat by accident, but let me explain the reasoning.

You can put LLMs through a benchmark that essentially tests their ability to “find a needle in a haystack”. This is a difficult task because the thing it’s searching for (some arbitrary key) is extremely small, relative to the overall prompt contents (the haystack). So a lot of non-key information is passed down to the next token, and the important details can easily get lost in the signal.

Now, note that if you have a very large input, with a small classification output, like this:

Example Input: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: B
Example Input: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: C
Example Input: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: D

when you have few-shot examples, the output makes up a tiny part of the overall prompt. And what does this mean? It means that the model essentially has to look for a needle in a haystack in order to actually find the thing it’s supposed to be learning from.

The behavior I noticed in one of my prompts that was dealing with a task like this is that instead of classifying, the model would just write more input. It was essentially continuing the pattern of the input, being completely blind to the label it was actually supposed to be predicting.

The solution? If the model can’t see the output because it’s too small, make the output bigger.

Example Input: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: The label is: BBB
Example Input: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: The label is: CCC
Example Input: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: The label is: DDD

Result? The model finally started classifying. Mission accomplished.

My assumption is that the presence of more tokens of output makes the pattern more “visible” to the LLM because more information from those parts of the example stream down to it. This has an interesting implication for writing parts of the prompt in ALL CAPS. I noticed way back in the day that, although writing in ALL CAPS was discouraged (because that way the model wouldn’t get the “meaning vector” from words that would otherwise be a single token) using ALL CAPS seemed to be able to draw a model’s attention to an area of a prompt better. I now believe this is precisely because ALL CAPS makes some text take up more tokens to express a single concept, meaning that more actual information gets sent down to the LLM. I’m sure this has a bunch of other applications as well — things that were once thought special quirks that are now intuitively applications of this generalized principle.

I don’t know if this has much of an impact on classification tasks where the input is small. Theoretically it would be less important there, but it might still have a performance gain since the LLM can still “see” the class better. If you happen to try this principle out on your own classification tasks, let me know in the comments if it also works for cases where the input is small!

This has been a relatively quick exploration of a relatively easy-to-grasp prompting principle, but honestly in terms of usefulness this one ranks pretty highly. The others deal with consistent behavior; this one deals with the information the model can use.

Hopefully, together, these prompting principles expand your prompt engineer’s toolkit! They are entirely learned from my experience building AI tools: they are not what you’ll find in any research paper, and as a result they probably won’t appear in basically any other AI blog. Anyway, that’s it for me this week. I totally published this in time and not at all at 3 AM :)

Have a good one and I’ll see you next time!

Prompting Weekly

Discussion about this post