Compress the Input: Dealing with long contexts when generating data and training models
Brute force is costly. Try to avoid it with this approach.
Few things can make or break an LLM for a task like context length does. Some work just takes a lot of tokens, and forces you to use models or methods you normally wouldn’t. This can be costly, in a literal sense: with most API providers charging by the token, extremely long prompts can get extremely costly.
However, when you’re training or finetuning models for tasks that require a ton of tokens…
It’s even worse.
If you’re making a domain expert model for a task that requires tens of thousands of tokens, and are trying to not lose hope as you see multiple H100s OOM again and again and again even when training small models, this is the post for you. Training LLMs for high contexts gets obscenely memory intensive, so the point of this post is: if possible, make high-context tasks into low-context ones through clever tricks.
Let me explain.
Say you’re using LLMs for a bulk task:
Make some important, nuanced decision, based on the following very large table of numbers:
value a | value b | value c
----------------------------
1 | 2 | 3
13 | 777 | 777
104 | 85 | 96
0 | 337 | 187
1 | 33 | 7
800 | 813 | 5
... 20,000 more rows follow ...
Maybe you truly do want an LLM, which has a semblence of wisdom baked into its weights, handling a decision instead of some inflexible algorithm. That makes some sense. Perhaps you’re also finetuning a domain expert to make decisions like this, based on all the data. However, 10,000 rows with mutiple variables is a lot of tokens, and this is costing you money with both datagen (it takes larger prompts to generate the data), training (you have to rent larger GPUs, and more of them, to actually fit the context length), and inference (more tokens means more cost). All the while, despite your domain expert finetuning, the LLM is seemingly struggling to remain coherent at such a high context. How do you overcome this?
Well, if the high context is the problem, all you need to do is describe the same thing with fewer tokens. In this case we have a bunch of numbers — tabular data. We want the LLM to use the general information baked into this data to make some decision. If only there were some way we could concisely describe the overall properties of large groups of values…
Oh wait, we have the entire field of statistics.
Make some important, nuanced decision, based on the following information about some very large table of numbers:
Value A mean: 12
Value A standard deviation: 3
Value A skewness: 2.76
Value B mean: -2
Value B std: 0.5
Value B skewness: 1.2
Value C mean: 2
Value C std: 15
Value C skewness: 3
... nothing else follows, that's it ...
By keeping our goal in mind — having the LLM make a decision based on the overall properties of a group of numbers — we can avoid tunnel vision on how to get the LLM working with 20,000+ rows of data, and save a huge amount of money on datagen, training, and inference… by turning our high-context task into a low-context one.
This is not limited to just numerical tasks. The broader lesson is to use clever hacks to reduce the scope of large tasks. Repetitions of words can be represented as a word x the number of times it shows up, rather than showing each datapoint. Cutting out the middle of text can be good for classification. And, of course, statistics works on text as well.
It’s the ends that matter, not the means. I say this not in a ruthless way, but rather to emphasize: don’t get so caught up in the first “means” you thought of, that you miss an easier way to achieve your ends.
Smaller inputs are great. Models behave nicer at smaller inputs, they’re far faster and cheaper to run, and they’re cheaper and faster to train, too. If you’re making a domain expert, it’d be unfortunate if it was still expensive to run in the end. So find clever and hacky ways to reduce the size of the input you use.
Finally, the statistics class you took in university ends up being useful.
Hm. That subject ended up being shorter than I thought it would be, to write about. I guess it’s ironic that the article about compressing the input ended up quite compressed itself! Two wins for concision, today.
That’s all for this week. Look, I posted two weeks in a row again, I’m practically on a roll. Hope you found it useful! This one’s more for niche projects but it can seriously save your skin if you’re working with massive inputs and are tearing your hair out.
Have a good one and I’ll see you next time!
A company is asking me for AI Consulting. This will be my first time. I have no idea how it will even work. (If by any chance you have any tips or materials that could help me and would like to share, it would be nice).