When Custom LLMs are Useful

A surprising application of prompt engineering

Jun 03, 2024

Prompting is a great first-resort weapon in an LLM-user’s toolkit. With the right techniques, it’s also a powerful second-resort and third-resort. But there are some things that will be fundamentally tricky to get an LLM to do with in-context learning alone. When these issues crop up, what do you do?

Today we’re going to talk about when and how creating a custom LLM on your own data fixes two key issues. We’re going to talk about why prompting often falls short in these areas, and where finetuning your own model comes in handy. Because this newsletter isn’t “finetuning weekly” we’re also going to explore how even finetuning a model boils down to prompting in the end.

Let’s get started talking about some key problems with applying LLMs, and how to solve them!

Issue #1: Recalling Factual Information

AIs were trained in the past, and so only have knowledge of the past. Your new startup, new product line, new organization, etc., likely did not exist in the past. Therefore if you’re deploying an AI as part of your new thing, the AI will have no intrinsic knowledge of the system it’s a part of. This is a very frequent cause of hallucinations.

Consider some likely scenarios:

A user asks about an issue similar to, but not identical to, one cited in a help center article; the AI retrieves the article using Retrieval Augmented Generation (RAG) and quotes it verbatim, failing to solve the problem.
A user asks something basic about the organization maintaining the AI, and the AI hallucinates or cannot answer because the needed information is not in its context or ingested documents.
A user asks an AI to “list the important features” of a platform and the AI only gets one or two of them because no such summary document exists, and the AI (using RAG) can only retrieve things that have already been written — it cannot combine and extrapolate from them.

These issues and others are typical faults of an AI lacking a “big picture” understanding of a problem, task, or service. Often, the LLM will simply parrot what’s in the retrieved context, rather than paying proper attention to a user query. If you tell the LLM to only use the retrieved context as a reference, rather than relying only on information it can see, then you run a greater risk of hallucinations. And of course, if RAG retrieves the wrong document, your generic LLM is doomed.

With this many faults, RAG can often be insufficient for reliable use. The information LLMs need to know is frequently too large to all fit in the context window, and too nuanced for RAG to be consistent. The easiest solution is to create a custom LLM: by finetuning an LLM on data related to your project — whether that’s sales copy, help center articles, or backend documentation — it can come to understand the “big picture.”

When you ask GPT about a major event before its knowledge cutoff, it probably won’t hallucinate. Finetuning on your custom data helps the model you create achieve an even greater effect than this, about any kind of information you want.

Take this example, from some recent client work I did for the open-source blockchain protocol, Verus (the work’s being done in the open, and is open-sourced, so I’m not breaking any privacy or nondisclosure here). The goal of this was to teach an LLM to answer questions about Verus, from memory:

This is without RAG. The statements are right. If you doubted that finetuning can teach a model factual information, there’s your counterargument.

Finetuning also helps fix a second critical flaw with RAG: understanding what the model sees in context. Put simply, even if you can add relevant information to what the LLM can see, there’s no guarantee that the LLM will understand it correctly. Finetuning on quality instruct data tailored to your requirements dramatically improves reliability in cases like this.

I’ve also seen finetuned models correctly answer questions even when the retrieved document was wrong — training can act as a “last line of defence”, improving reliability for a production application. If you’re dumping hours of prompting work into trying to get an LLM to output the right information, and this is for an important system in your project, it might pay better dividends to prompt and code a synthetic data generation pipeline and train up an LLM. Something like Augmentoolkit can serve as a good base to modify.

Issue #2: Writing Style

When you’re trying to adjust an LLM’s writing style with prompting, you’ll often have to meet it halfway. By this I mean, even if you devote the time to write 10 thousand tokens of few-shot examples, the writing style that your AI uses will likely be some hybrid between what your examples show, and how the AI’s been trained to speak. You can certainly influence writing style with prompting, especially with open-source models — but you rarely have total control over it. If branding is important, or your task benefits from precise control over the tone of voice the AI uses, then a custom LLM is probably the easiest way to get there. Not because creating a custom LLM is easy — it isn’t — but because every other approach struggles so hard with this specific problem.

But here we run into an issue. If having absolute control over an LLM’s writing style is extremely difficult to achieve with prompting alone, then how do you create the dataset for your custom LLM? Since most dataset creation involves using an LLM, we come across a real chicken-and-egg problem, here.

In my opinion there are a few good solutions, and they basically revolve around using different types of synthetic data. “Augmented” data — the term I coined for conversational datasets like Augmentoolkit generates — can often use direct quotes from a given source of information, and since the wording isn’t changed, the LLM will be trained on that exact wording and its style will be influenced. Augmented data therefore allows for the imparting of additional knowledge to an LLM, while also influencing its style. Another kind of synthetic data (a kind that’s existed for a while) that I call “annotated” data can also be a good solution to aligning the writing style of a model.

Annotated data is essentially where you take some text that you want an LLM to be able to produce — like, say, a poem about bears — and use an LLM to generate instructions that could have produced that text (in this example, “Write a poem about bears”). You then, when training your new LLM, use the instruction that the LLM generated as the input, and the human-written source text as the output. It’s a cheap way of getting quality partly-human-written instruct data, though this method is sometimes held back by its reliance on well-formatted human-written text.

When it comes to aligning the writing style of AI responses, since annotated data trains directly on human-written text, if you have enough text to annotate then this is going to be a very good approach to take. It’s very fast to put together, too — most annotation pipelines I create have just one prompt.

Why prompting is still important

We just walked through two problems that prompting struggles to solve directly. Am I turning my back on the namesake of this newsletter? Hardly. In an era where the best datasets are synthetic in some way, finetuning could be called indirect prompting at a greater scale. The way you construct your prompts determines what kind of datasets you generate, which in turn shapes the model trained on that data in a fundamental way. Having gotten back into finetuning after spending many months doing just prompting, I will say that it’s very liberating to have such precise control over the kind of model I can produce, by combining these two areas of expertise. I disagree with the people who spend time delinating when you should prompt engineer, and when you should finetune — they’re both part of machine learning now, and they’re both important.

Summary

Approaches to modifying LLM behavior without changing the weights can struggle to influence the LLM’s knowledge and writing style. These two problems are, however, trivial to solve with finetuning. Finetuning LLMs often involves heavy prompting, so it’s viewable as a form of indirect prompt engineering. It’s important for someone working with ML to have complementary skills such as finetuning and prompting, which enhance each other and make previously-unsolveable problems, solveable.

A good place to look for an example of synthetic data generation is my open source project, Augmentoolkit (https://github.com/e-p-armstrong/augmentoolkit/tree/master). MIT licensed. You can also get in touch with me using my contact info there, if you have a professional project you’re working on.

Finally, if you want to read more posts like this and found this helpful, consider subscribing and sharing!

Alright that’s all for this week. Thanks for reading, have a good one and I’ll see you next time!

Prompting Weekly

Discussion about this post