Make your SFT data Resemble your Pretraining Data

Don't rock the boat

Jun 30, 2025

With any skill, there usually exist a few simple tricks that can improve it greatly. Where you can lift a finger and get a 20%+ improvement, or similar. Unfortunately, as simple as these tricks are, they’re usually only obvious to us in hindsight. Fortunately, this makes them very easy to share.

Here we’re going to cover one such trick, which relates to (post) training LLMs: making your supervised finetuning data resemble your pretraining data.

LLMs are, of course, trained in two main phases: pretraining and post training. You first make them into very capable autocomplete by training on a huge amount of text from the internet. Then, you teach them to be able to have conversations and follow instructions (with “Supervised Fine Tuning” (SFT)). Because as it turns out, you can make autocomplete into a conversational AI by just giving it a conversation and having it complete the next turn in that conversation.

So that the code running the LLM can know when the LLM has finished completing its turn of the conversation, LLMs are usually trained to follow specific formats.

These formats are usually arbitrary. The common thread between them: they have room for a “high-level instruction” or “system prompt”, and then have a series of messages, each with a role:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
{response}

^ That is chatml.

What’s the problem with this?

This doesn’t resemble any text you’d readily find on the internet. It’s very separate in its format and structure. And arbitrary. Especially sice <|im_start|> is usually introduced to the model as a new special token — an atomic (indivisible), completely new part of its vocabulary.

If you train an LLM on this format, sure, it’ll learn the format. However it’ll also lose a lot of its “spark” that it had when it was autocompleting the internet. Sure, some of the original understanding will transfer over. However, you’re fundamentally teaching the model how to do one thing very well, and then switching the task and format on it suddenly. The model will learn your *new* task of conversation decently, but at the expense of its original capabilities, because the arbitrary formatting of this new task makes it clearly different from the previous one.

But what if an LLM could have conversations without an arbitrary format?

Instruction: instruction
Human: blah blah
AI: Blah!

No special tokens, no surprising format, just back-and-forth like you might find in an internet forum or screenplay or whatever else. Your conversational data now resembles your pretraining data more. And as a result, the LLM’s previous spark and capabilities seem to be preserved much better: it’s not going from being autocomplete, to being a conversationalist, losing much of the depth and nuance of its pretraining and becoming shallow and fragile because it has to try to generalize from a small, largely-synthetic SFT set. It is now explicitly emphasising a specific part of the autocomplete task. Because in the model’s mind (once the format becomes normal enough) there is no real difference between the SFT and Pretraining data.

And as it turns out, training like this seems to make models work much better, at least in my experience. If you’re training them on facts, they remember the facts from pretraining better and can quote the original texts; if you’re teaching them creative writing, they retain much more human spark; etc. Whatever you’re doing this seems to work. And from what I hear, Anthropic does something similar.

Abandon the Arbitrary. Embrace Pretraining.

And if you do this, train on inputs as well (completion_only = False). The human side of the conversation == more data and it’s important to training dynamics that you use it.

Sorry for the late post this week. I think Sunday will be the day I post normally from here on out. So, weekly, on Sunday.

Augmentoolkit got some fixes this week, and some cool stuff is in the pipeline there, coming out soon!

Following popular demand after that last post, we’re keeping with the “short and sweet” theme. I hope this was useful to you :)

That’s all for this week, and I’ll see you next time!

Prompting Weekly

Discussion about this post