Dataset PSA: Use many different system prompts

A simple trick to un-stupid your models

Jul 22, 2024

Model training is just difficult enough that relatively few people do it. Most AI hype bloggers probably can’t tell you what an OOM is, instead preferring to talk about the latest proprietary model or the 1234th iteration on chain of thought. (OOM means “Out Of Memory” btw). But if it’s true that:

“People who are really serious about software should make their own hardware.”

I’d argue that the ML version is:

“People who are really serious about using AI should make their own AI.”

Unfortunately there’s very little in the way of tutorials for cutting-edge model training techniques for LLMs. So if you’re getting into it, you need to learn a lot of non-obvious things. This involves either asking an expert friend, “Hey expert friend, why isn’t this thing working?!” or struggling and wasting money until you manage to figure it out yourself. Having done quite a lot of iteration at this point (and having pestered my friends a decent amount too — thanks Alignment Lab AI and Ikaridev!) I’m going to hopefully save some of the model/dataset creators here from some pain, and cover a simple pitfall when training models.

Use many different system prompts during training

Conversational training data often looks something like this (this is ShareGPT format):

{"conversations": [
{"from": "system", "value": "You are a helpful assistant, answer questions."},
{"from": "user", "value": "What is 2+2?"},
{"from": "assistant", "value": "That would be 4"},
]}

During training, the AI learns from the parts in the “Assistant” response. You might have many of these, in a single .jsonl file:

{"conversations": [
{"from": "system", "value": "You are a helpful assistant, answer questions."},
{"from": "user", "value": "What is 2+2?"},
{"from": "assistant", "value": "That would be 4"},
]}
{"conversations": [
{"from": "system", "value": "You are a helpful assistant, answer questions."},
{"from": "user", "value": "What is 4+4?"},
{"from": "assistant", "value": "8"},
]}
{"conversations": [
{"from": "system", "value": "You are a helpful assistant, answer questions."},
{"from": "user", "value": "What is the airspeed velocity of an unladen swallow?"},
{"from": "assistant", "value": "20.1 mph."},
]}

This is what I did for a while. Turns out it’s wrong in a very slight way that is liable to make your model not only incredibly stupid, but also very stubborn about instructions.

You see, if you train a model on only one system prompt across 10,000 samples and millions of tokens, it’s going to “overfit” to the prompt, and also learn to mostly ignore it. If you remove the prompt during inference the LLM won’t remember anything it learned, if you change the prompt the LLM will fail in strange ways, and if you leave it it’s very likely that the model will act in a stilted, overfit, and fragile way.

You’ll look at the data you’re putting in, and the outputs you’re getting out, and start wondering where it all went wrong — how can you be getting such trash results when you know your data quality is good?

The solution is to have a set of 5 to 10 system prompts that you use for each dataset. These should each be really different:

Different lengths. Make some prompts sentence length, some in the multiple-thousand-token range. For the multiple-thousand-token range ones, you’ll want to typically include background information about the task you’re training on. If it’s style, describe the style; if it’s facts, give a summary of the subject you’re training about.
Different formats. Have blobs of text; have lists; have anything else you can think of.
Different order of information. If one of your prompts used during training explains a subject in the order “A B C”, try “C B A” on the next one. Or, even better, “C X Y Z".
Different levels of polish. I’ve deliberately misspelled things in system prompts used during training before, does not seem to hurt.

Also, in general, I’m OK with AI generating these — the AI is not trained on the prompts (input), just the output, so the usual qualms about training on GPT data fall away here. So feel free to save time and use AI to create variations.

If your dataset has the same prompt for all its samples, then randomly replace each system prompt with one of your new set using a script. Do this for each dataset you have where all the system prompts are the same.

This move alone can turn a model from stupid to stable, but it’s not necessarily well-known. Hence, this post about it. Try it on your next training run! Not only should it improve the intelligence and resiliance of your model, but it’s quite likely that it will also let you prompt engineer your custom finetuned model, which is where the real power of AI gets unlocked.

Here’s a bonus tip for a real hair-pulling problem:

If your model is repeating the same tokens/can’t make a coherent response (over and over and over and over and over and over and over and over and over and over ← like that) try checking your tokenizer settings.

I’ve noticed that sometimes adding new instruct tuning tokens (like <|im_start|> to the tokenizer itself can sometimes cause real dealbreaking stupidity in models, especially when there isn’t enough new instruct tuning data for the LLM to learn these new tokens. In an axolotl config, you can see what’s being added to the tokenizer by the lines:

chat_template: chatml
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Here, the begin of string (BOS), end of string (EOS), and unknown (UNK) tokens are being overridden, and the <|im_start|> and <|im_end|> tokens are being added. The chat_template ensures that instruct data is formatted using the chatml format: adding the tokens just controls how this format is actually tokenized for the model.

Even if your data is perfect, whatever model uses these settings above will probably come out completely unusably broken. I can’t think of when you’d ever want to override the bos and eos tokens; and hoping that the model generalizes to two new special tokens you added, <|im_start|> and <|im_end|>, is risky (and probably won’t happen if you have little data). I’ve had much better results not adding anything to the tokenizer, and leaving it alone as much as possible:

chat_template: chatml

You will probably want a pad token. Set it to your model’s EOS token. For llama 3:

chat_template: chatml
special_tokens:
  pad_token: "<|end_of_text|>"

If you found the stuff in this post interesting, I would appreicate a share so that new people can find it :)

There’s a lot to cover

There are a lot of “gotchas” with LLM creation. It’s a very finicky process with a lot of pullable levers, and it’s a process which can explode in your face and produce utter shit if just one setting is wrong. I’m still wondering how to tackle the subject of writing about LLM creation — tactics like this are useful only if everything else is working AND if you’re actually creating models.

For instance, knowing that you should vary up the system prompts won’t be useful if your learning rate is way too high (or way too low). Absolutely nothing will help you if you keep OOMing because you picked the wrong optimizer. I try to keep this newsletter writing about new things I discover that haven’t really been written about by anyone else, and I don’t have time to produce a full LLM training tutorial, but I feel bad rambling on and on about very niche stuff, so: if you want to learn more about model training and know basically nothing, the Axolotl examples have some sensible defaults for settings, and the Augmentoolkit repo has a video tutorial from the Verus fork of Augmentoolkit which shows how to rent a GPU, copy data over to it, and train a model using Axolotl — check out Video 2. That should hopefully be enough to get you started? After that point, it’s just burning money while you iterate 😅

Alright that’s all for this week. Thanks for reading, have a good one and I’ll see you next time!

Prompting Weekly

Discussion about this post