Data Principle: Dataset "Tightness" and the Instruction Following vs Creativity Balance
Control the detail of the data, control the fundamental ability of your LLM.
Everyone wants fewer hallucinations from their LLM. Most people also want their LLM to be creative and able to write well. If you’re creating LLMs as a model trainer, you will want to know how to improve both of these qualities — and how to compromise between them when necessary. This comes down to what I refer to as “data tightness,” and that’s what we’re going to learn about in this post.
Data tightness is one of those terms I’ve coined that’s all very intuitive and whatnot if you’re me, but probably sounds utterly alien if you’re not. So we’re going to start by defining it. I define data tightness as “the level of detail with which the inputs in your training data describe the outputs.”
Write a paragraph about pineapple on pizza.
^ With very few details, and a short length relative to what might be a pretty long essay, this is “loose” data. I.E., not tight.
Compare this to (warning, long, feel free to skim):
Write an essay with a unique theme focusing on how pineapple on pizza reflects broader cultural trends and debates. Here's a detailed outline:
Title: "The Great Divide: Pineapple on Pizza as a Microcosm of Cultural Evolution and Conflict"
I. Introduction
A. Hook: Provocative statement about pineapple on pizza controversy
B. Brief history of pizza and pineapple as a topping
C. Thesis: Pineapple on pizza represents a microcosm of cultural evolution, globalization, and the tension between tradition and innovationII. The Origins of Pineapple on Pizza
A. Invention of "Hawaiian" pizza
1. Sam Panopoulos and the Satellite Restaurant in Canada
2. Cultural context of 1960s culinary experimentation
B. Spread and popularity
1. North American adoption
2. Global expansionIII. The Great Debate: Arguments For and Against
A. Pro-pineapple arguments
1. Flavor profile: sweet and savory combination
2. Nutritional benefits
3. Culinary creativity and freedom
B. Anti-pineapple arguments
1. Violation of traditional Italian cuisine
2. Texture concerns
3. Cultural appropriation claimsIV. Cultural Significance
A. Pineapple pizza as a symbol of globalization
1. Fusion of cuisines and ingredients
2. Breaking down culinary borders
B. Generational divide
1. Younger generations' openness to culinary experimentation
2. Older generations' attachment to traditional recipes
C. Internet meme culture
1. Social media debates and hashtag wars
2. Celebrity opinions and their impactV. Psychological and Sociological Perspectives
A. Food neophobia vs. neophilia
B. In-group/out-group dynamics in food preferences
C. The role of cultural identity in food choicesVI. Economic Impact
A. Pineapple pizza's effect on the pizza industry
B. Tourism and cultural exchange through cuisine
C. Marketing and branding strategies around the controversyVII. Case Studies
A. Iceland's pineapple pizza ban controversy
B. Gordon Ramsay's public stance against pineapple on pizza
C. Pizza Hut's global menu variationsVIII. The Future of Pineapple on Pizza
A. Emerging pizza trends and toppings
B. Potential for new fruit toppings
C. The role of plant-based and alternative ingredientsIX. Broader Implications
A. Food as a battlefield for cultural identity
B. The impact of globalization on traditional cuisines
C. The role of controversy in culinary innovationX. Conclusion
A. Recap of main points
B. Reflection on pineapple pizza as a mirror of societal changes
C. Call to action: Embracing culinary diversity while respecting traditionsWrite with a formal and scholarly tone.
^ With a lot of detail, explicit structural and stylistic instructions, and a full outline, this is very “tight” data. Also, I hope you didn’t take too long reading all that.
So anyway, pizza aside, say that you have a dataset with a lot of essays as the outputs, and you’re choosing between having inputs like the “loose” prompt or inputs like the “tight” prompt. What’s the difference? This comes back to our tradeoff between creativity and instruction following.
Tight data trains a model to follow instructions better, and makes it more robust. If you’ve created 10,000 detailed plans, and 10,000 examples where an LLM follows them to the letter, your LLM will get pretty good at following plans to the letter.
On the other hand, if you have 10,000 short prompts, followed by 10,000 very long and detailed essays, the LLM will be able to invent detailed essays on the spot with ease. But if you give it a detailed plan it will also almost certainly add in details that were not present in your plan. Because throughout every single one of your 10,000 short prompts you were teaching it “please invent stuff related to what prompt I give to you.” It won’t unlearn that just because you suddenly start giving it extensive bulleted lists of stuff to write come inference time. Sure, it will probably include most of the things you tell it to, but there’ll also be a lot more stuff added in. Not good, if you want a precise expert model.
So we have something of a tradeoff, and an accompanying decision. How tight or loose do we want our data to be? How reliable or creative do we want our model to be?
And we also have a framework for identifying problems with our LLMs.
If we give an AI a long and detailed instructions, and it comes back with outputs smaller and blander than the outputs we provided, our data was likely too tight. This happened to me a while ago when I was trying to make a product for people to create AIs that would write Tweets in their voice — the instructions in the data were longer and more detailed than the Tweets (the outputs) themselves. As a result, any LLM trained on the data became a machine for taking detailed, grammatically correct text and turning it into short, Twitter-level text. Absolute brainrot. Beyond saving. Until I loosened the data a bit. Instead of “Write a two-sentence Tweet decrying pineapple on Pizza, saying that the prime minister of Iceland should have gone ahead and banned it. Write with exaggerated anger and end with a swear.” the instruction might be “Write a short Tweet about pizza.”
Likewise, if your LLM goes off the rails that your instructions lay down, and just. won’t. listen., try tightening the data. Revise your data generation prompts to produce more detailed descriptions of your outputs. The LLM that results may be far smarter than before. This is tricky because data looseness issues can masquerade as hyperparameter issues — you might convince yourself that with the right number of epochs, dropout, learning rate etc. that your model will actually learn the data very well. But it might be that your model is struggling to learn the wrong thing.
The order in which you should work on new projects is: data (generation pipeline) → hyperparameters (training a model) → prompting (inference) → sampling parameters (inference). The order in which you should troubleshoot is the reverse, because the steps are harder and more time consuming as you descend (sampling parameters may be an exception). But troubleshooting is tricky because if your model is failing at inference, it could be a prompt issue, or a sampler issue, or a hyperparameter issue, or a data issue. If the model has a weird loss graph it could be a hyperparameter issue or a data issue, etc. At any link in the chain, the problem could be from any link before it. But the most fundamental thing is data, so it’s worth the time to get that right.
Right now I’m favoring tighter data. Smart models are more useful. And tighter models can be prompted more. LLMs are almost always capable of doing something right once or twice, if you regenrate enough — the trick is always to get them reliable enough to be useful. Tightness is key there. Even for creative tasks, like RP, the LLM has to “tightly” follow character cards, scenarios, and to play off of what the user says. Plus, I don’t think the drawbacks on creativity really come out in force until you get REALLY tight data.
Fundamentally, so many model trainers look to their outputs when they’re making their data. But the output is only half of the question. “Produce this output, given this input.” If your model is misbehaving, try checking what its prompts are in training.
So basically: data tightness is the detail with which the inputs describe the outputs during training; tighter data leads to more robust, but potentially less creative, models; loosen data if the model is terse and stupid, tighten it if it’s rambling and stupid (really, if your model is stupid, you did something wrong somewhere); make new LLMs in the order of data → training → prompts → sampler settings; personally I think tighter data is better because robustness is a more painful problem than creativity. I hope that this helps you create better models, more reliably!
As I wrap up, let me briefly apologize for suddenly vanishing. Augmentoolkit’s thriving and that’s led to a lot of awesome client projects for me to juggle, so I’ve been up against the wall for a few weeks. Combine that with a lack of energy and a perceived lack of stuff to write about (NDAs everywhere!) and I rarely found myself inspired on Sunday. I’ve settled into a rhythm with my regular client work though, and am getting better at managing my time, so I hopefully should be able to resume regular-ish posting.
I would change the name “Prompting Weekly” but SEO and reputational stuff means my hands are tied here. Plus, “Prompting Biweekly” or “Prompting OnceInAWhile” both sound worse. The name’s an ideal, I’ll see if I can meet it.
Anyway, that’s all for this week. Thanks for reading, have a good one and I’ll see you next time!