How to Make a Bad Language Model and Use it Poorly
There's plenty you can do wrong when using AI.
AI requires attention to detail, because one astray sentence can easily wreck a ten thousand token prompt. In Augmentoolkit, my open-source project, a joke wasted probably more than ten hours when it caused a consistent error. When hunting for mistakes, sometimes the esoteric nature of AI makes knowing what to look for difficult in and of itself.
This post, a sort-of broad summary of many ideas, exists to help you fix things you might not have known were mistakes. We’re going to broadly cover a large number of prompting and training practices that you should do, if you like bad results. And which you should avoid if you don’t, for some reason.
I can’t go in-depth on each point because otherwise this post would be a novella. Everything here is based on my own experience.
Let’s begin!
How to Make a Bad Language Model and Use it Poorly:
To make a bad language model, be sure to do the following:
Each thing you train on has the input (context) and the output (trained on). Don’t thoroughly enough explain the task, and the desired output, in the input.
Or go the other way and explain the task too thoroughly in the input, such that the model becomes an uncreative parrot.
Don’t use code or LLMs to validate/filter your data in some way. Garbage in garbage out.
Use GPT-4 to generate the data. (You’ll go bankrupt before scaling laws work in your favor.)
Use GPT-3.5 to generate data. (Open models are flat-out better.)
Don’t include any generic off-the-shelf assistant data (important to retain overall performance and smarts.)
Don’t pull some off-the-shelf hyperparameters for your first few training runs; just guess them yourself.
Train on a bad base model. No, Llama 2 is not great anymore.
Train on top of a merged model (a model created with weight merging) and then do not merge the original merge’s weights back into your finetune with 33% weighting.
Try to add knowledge via pure synthetic data (data generated only from prompts with no input text — you can only narrow its style this way).
Don’t iterate. Give up after the first failed finetune.
Finetune a LoRA instead of using GaLore when trying to add more than style (such as factual knowledge) to a model.
Using an LLM Poorly:
Don’t add any “trigger words” from your training to your inference time prompt. They’re useful for activating the latent space associated with your training.
Directly using a prompt you used during training, verbatim. This can result in hallucinations easily.
Don’t give it few-shot examples. Open models are not like GPT — they’re often not overtrained, and so they can learn from their prompts. But often they HAVE TO.
Don’t repeat yourself. LLMs pick up on the same instruction repeated better than the instruction stated once, so if you’re using a model badly, do not repeat key information.
Make each few-shot be for an easy success case. Models don’t need to be shown how to do stuff that’s easy; you’re guarding against them failing to do hard stuff. They can extrapolate how to do the easy stuff given hard examples.
Treat each model the same. Every open source model has its own quirks, faults, and strengths. Mistral Large prompts need modification from Command R + prompts. Their inherently preferred output formats are different, their writing styles are different, some things (i.e., response length) don’t need reinforcement with one but absolutely have to be mentioned many times with the other)
Don’t ground it. Finetuning can help a model understand something broadly — if you’re getting awful performance with RAG, finetuning is a good place to go. But it still needs a reminder — so that it’s an open-book test, so to speak.
Invest your time evenly across all the prompts in your project. Very likely there’s one seemingly simple but critical problem that keeps consistently appearing. It would be rational to spend a lot of your time, perhaps entire days, on solving this single simple problem for good. But if you wanted to do things poorly you’d feel guilty about “wasting time” and declare it solved too early, therefore compromising your application.
Write the prompt once and never do an edit pass.
Get GPT to write the few-shot examples for your prompt.
Get GPT to write the system prompt.
Include fewer than two few-shot examples.
Don’t include any code that checks for common mistakes in your LLM’s outputs.
Describe what you want in a massive block of unstructured text, rather than in a list.
Succumb to the self-criticism that always peaks right before you finish a prompt, and give up; prompting is something where you put in a ton of effort and don’t see ANY results until the very end.
This list is not comprehensive. There are an infinite number of ways you can fuck up. But I hope I covered enough in this list of what NOT to do that you found this helpful! I figured I’d break up the talk about specific models with something that universally applies to all prompting AND model training. At least one thing in this list should be useful, if you use LLMs.
One last thing: I’m considering rebranding to call this newsletter “Leverage Langauge Models” or something else, because gosh darn it the SEO for prompting weekly is ABYSMAL right now 😅 anyway, if the name suddenly changes, that’s why. If you have a name idea drop a comment!
That’s all for this week, thanks for reading, have a good one and I’ll see you next time!
What a long list. Hard to read as I always have to think and apply the opposite to get the results I am looking for. 😅
My suggestion for the rename:
The Art of Prompting