“What model do you train on?”
This is a question I get often.
When I give my answer, I often get the followup ‘question’:
“What the hell, isn’t that very old and out of date!!”
Let me explain myself quickly and hopefully help you improve your training runs in the process.
I train a lot of models. Ever since I started building Augmentoolkit, many of these training runs have been about teaching models new facts. And these days I always use Mistral 7b v0.2.
Truth is, I have tried training on some more recent models, such as Llama 3.1 8b. More parameters, newer/smarter model—common sense suggests that these things would work much better.
But they don’t. They come out more stupid and without as much factual memorization, despite the much greater size.
Why?
Overtrained models are harder to fine-tune.
To push benchmark scores higher, many companies abanoned chinchilla scaling laws and pumped obscene amounts of data through their models. Sure, this makes them better generalists, but it also makes it harder to turn the models into specialists. And, obviously, the custom AI world is all about making specialists.
Analogy: if you’re going to paint a painting, you want to start with a blank canvas, not someone else’s painting.
Mistral 7b v0.2 is closer to a blank canvas than Llama 3.1 is. Perhaps other recent models are less overtrained — Meta DID rely on scale a lot as its GenAI org fell further and further behind — but Mistral 7b is a good reliable daily driver. Also, because it is older, its pretraining data comes with less GPT data contamination — good for avoiding GPTisms.
You know the old prompting principle? If you’re fighting the model, Stop? Same goes for finetuning, and whether you’re teaching a model new facts, new writing styles, or new tasks, some models are easier to teach than others. Make your job easier and train the right base models.
For now I’ll keep using Mistral 7b. Maybe not all new models are overtrained as hell — I suspect I could get some decent mileage out of one of the new Qwens, for instance. If you’ve had success training a more modern model on serious tasks, let me know in the comments or the Augmentoolkit Discord! But otherwise, remember that these things fundamentally map an input to an output, and that when you’re training, so long as the model has not been trained past the point of no return, your data and hyperparameters probably matter more than your base model choice.
I’m trying a shorter more spontaneous format that focuses around a core idea, a singular argument. Before, writing articles got tiresome as I Had to come up with example after example, or multiple cases. If they’re short and sweet I can cover more topics and I can stay happy writing these, which will make them better anyway. Let me know what you think of this more condensed format?
Anyway, that’s all for now (may be more posts later this week), have a good one and I’ll see you next time!
Tried Nemo 12B yet? That was the natural evolution for all my finetunes once I moved on from Mistral 7B. Sure, it's a bit bigger, but it has this same sponge-like effect when being trained. Extremely versatile. Only downside is the larger size, I suppose! (Though depending on the task, this may be an advantage)
You know what they say "If it ain't broke don't fix it"