Domain Expert Creation: Do's and Don'ts
How to give yourself a fighting chance when making domain experts.
As the creator of Augmentoolkit, which is largely (but not entirely) an open source project for creating domain experts, I get a lot of questions about creating domain experts. This article is about giving people who are getting started with creating this class of model some pointers. It’s aimed at people who know how to kick off a training run, but need to know how specifically to get good results with memorizing facts.
First off — can LLMs learn from finetuning? Yes. Don’t take my word for it:
Still, a lot of the people trying factual finetuning have a bad time. What determines whether one expert model will hallucinate like a 1b where another will overfit just fine? In this post I’m going to walk through my current understanding of domain experts, having built a large number of them personally and for clients, and I’m going to cover some of the main things to do and main things to avoid.
Keep in mind: domain experts can be really hard and painful to build even if you have the data. I do not claim to have mastered or solved this process completely. I have, however, figured out a few things, the most important of which are detailed below in case they are useful for you.
Do NOT: use LoRA
In many ways, teaching a model new facts is deliberate overfitting. When it is asked “Who writes prompting weekly” you want it to have memorized the response “Evan Armstrong.” When asked a question that appears in the dataset, you want it to give the answer from that dataset, word-for-word. This memorization is pretty much the definition of overfitting, but it makes sense: unlike finetuning for understanding a task, there’s a lot less nuance to facts — they are right or they are wrong. And you bake the facts into the model by overfitting it, which is easier to do when you change a lot of weights.
LoRAs are a method of parameter efficient finetuning: they change a lot fewer weights, by definition, and therefore make it harder to overfit. This means they also make it way harder to teach a model facts. Use LoRAs for style — if you’re making RP models, go for it — but when making domain experts, stay far, far away.
Do: Full Parameter Finetuning
Perhaps you saw this coming, but if we don’t use LoRAs because they don’t adjust enough weights, we DO want to use full parameter finetunes because they DO change a lot of weights (all of them, in fact). One thing to note: Augmentoolkit’s README originally recommended GaLore, since it had more or less the same effect as a full finetune but with less memory usage — however, due to lack of support, no distributed training, and the surprising cheapness of full parameter finetuning, I now recommend using full finetuning. It’s not too expensive if you use the paged_adamw_8bit optimizer.
Do: Continued Pretraining
Models may not learn only during pretraining, but they sure do learn a lot during pretraining. If you have a completely new subject matter that your LLM has never seen before* you will get FAR better results if you do training in a two-step process: first, take the documents you’re using to make domain-specific instruct data, and do continued pretraining on those for a large number of epochs (think 10 or 13 epochs, you want to really bake it in). Once that is complete, do your domain-expert finetuning on top of the new base model to reinforce the knowledge and teach the LLM how to answer questions and follow instructions about its new facts.
Continued pretraining is simply the process of teaching an LLM to autocomplete a document (with no instructions). It’s how base models are made in the first place. Though during continued pretraining we’re working in a much more focused way.
To do continued pretraining, get your document into JSON like this:
{"text": "...the whole document's text"}
And then point at it in your Axolotl config like this:
datasets:
- path: json
data_files: hidden_pretraining-us-army.jsonl
ds_type: json
type: completion
If you’re using it, Augmentoolkit will automatically turn the documents you are making instruct data out of, into continued pretraining data as well, so you can just use those.
*lots of domain expert training is helped by the fact that the LLM has already learned something about the subject. If the subject is completely new to the AI, you DEFINITELY want to do this.
Do: These Assorted things
As high a learning rate as you can get away with, without exploding the loss.
Obscene numbers of epochs in both pretraining and finetuning
Relatively high batch size
More training steps, if you can manage it (often means sample pack: false is the way to go)
Varied and diverse data about the domain you want to learn about.
If you really like kicking off training runs, doing a high batch size pretraining run, followed by a low batch size pretraining run, followed by high and low batch size finetuning runs, seems to give the model a somewhat better understanding of the facts — broadly, due to the high batch size runs, but also for the details, due to the low runs.
Experiment with small Mistral models because they’re cheaper, thanks to the small tokenizer. When it comes time to productionize, use Llama 3 8b or another good model.
If you have the budget, more parameters probably means better learning, too. However full finetuning a 70b is not something I’ve had the stomach or need to attempt yet. In many cases, 7bs can learn just fine.
I believe that you can mix generalist assistant data into the domain-specific finetuning just fine. Up to a point.
Use good-quality pretraining data. After the US Army model, I’ve come to believe that pretraining data quality and structure are absolutely key to final model performance, understanding, and generalization. You won’t get better documentation than in the military, after all.
Don’t: These other assorted things
Do not use a temperature other than 0 during inference if you want it to more reliably restate facts.
Mixing generic pretraining data into the domain-specific pretraining data does not seem to help generalist performance, but does no favors to factual recall. Probably avoid that.
Don’t forget that the number of GPUs you use impacts the batch size. If you’re having trouble reproducing results, think about the hardware. GPUs, Gradiant Accumulation Steps, and Micro Batch Size all impact the effective batch size of your training.
(I say this because I forgot about that fact and once had a… miserable time… trying to figure out why my config worked one day and failed the next)
I typically use no noisy embedding alpha or weight decay, since those add noise and we’re trying to overfit. You don’t want the “noise” to confuse the model about facts. Facts are sensitive.
Don’t: Get discouraged by other people’s better results
Models respond to different domains differently. I’ve had tried-and-true methods bounce off of certain domains for no conceivable reason. Lots of domain expert finetuning runs, such as most of the ones I post on Reddit, are helped by the fact that the LLM already knows a bit about the subject before any further training’s been done. It’s (much) harder if the AI knows nothing, but you can still get it to work with enough raw data, epochs, and elbow grease (it may take dozens of iterations before you have agreeable performance. Make sure you ask your client for a decent budget if doing this professionally. Not so much because it’s very expensive, but more because it’s maddening and infuriating at times and good money makes that a bit less painful).
Domain experts are hard, and things will not work as well as you want them to at first, but every time I’ve run into trouble, I’ve also been able to figure out a way to get decent performance regardless. You can, too. The fundamental process of training that you are doing when you’re making a domain expert is the same one that the LLM learned all its current knowledge from in the first place. Your goal is attainable, you just need the right settings, the right data, and the patience to withstand things going wrong for the first couple runs. Hang in there!
And also — you will want to start with something an LLM already knows something about if it’s your first time. Replicate before you iterate if you’re learning. The first LLM I finetuned was the small BLOOMZ model, following along with a tutorial from deeplearning.ai. Starting small is not just fine, it’s preferable.
That’s this week’s post. Yes, I am in fact alive, though I’ve been sorta away. Perhaps I should change the name to “Prompting Monthly”. Really, it’s not so much been a lack of schedule availability as a lack of inspiration and fear that my interests have diverged from what they were when I started this blog — I began by prompting, but now I mostly do finetuning and dataset generation to build custom LLMs, which I believe is a really cool topic but also a much narrower one. That, plus the fact that so many of the things I’ve been learning are hard to write about thanks to NDAs, meant that I was starving for subjects to write about for some time. I still somewhat am? But I’ve at least sorted out my schedule, so if I have ideas I now have the time and discipline to put them out there.
In happier news, Augmentoolkit reached 1k stars! I am incredibly happy about this — huge thanks to all contributors, issue-creators, and people who’ve shared and supported the project.
Narrow subject matter of this post aside, I hope that the tips were useful to those of you training models, and at least intellectually interesting even if you aren’t training models. That’s all for this week, have a good one and I’ll see you next time!
(And trust me, by GOD— There will be a next time!!!)
(Edit: You know how I don’t do editing on my posts? I meant to say “I hope that the tips were useful to those of you training models, and at least intellectually interesting even even if you aren’t training models” but ended up saying “I hope that the tips were useful to those of you training models, and at least intellectually interesting even if you aren’t.” Which is very insulting and was not my intention, even if it is accidentally hilarious. Sorry about that!)
Good read. May I ask why would we do finetuning instead of RAG?