Human-Sourced, AI-Augmented: a promising solution for open source conversational data
It all started with Roleplay
“What if I could recreate my favorite fictional character as an AI?”
This is probably something anyone who has a favorite story and likes LLMs has thought of.
(This is the lead-up to the thing in the title, so bear with me.)
Currently, the solution to this problem in the open-source AI community usually involves taking a general-purpose roleplay model like MythoMax and prompting it to behave like a given character (these prompts are typically called “character cards”). However, anyone who has tried this will know some of the problems involved:
Details, details, details. When a detail is slightly off, it’s VERY immersion-breaking. A character suddenly forgets what they look like, or makes a comment about their world that makes absolutely no sense? That’s a regenerate. Retrieval Augmented Generation can only go so far, and might not capture nuance in a character’s reply. On the other side of the spectrum, there are plenty of character cards, like these, which have MASSIVE lists of traits and physical characteristics in pseudocode format that are hard for humans to parse, let alone a mere 13b model (the largest type of open source model many people can run on their hardware).
Here’s an example of how things can go gloriously(?) wrong (when I tried to recreate Kurisu from Steins;Gate as an AI).
Personality. Writing style is a subtle and nuanced thing, and it’s very difficult to convey in a character card. This is partly addressable by giving a bot plenty of dialogue examples in the prompt, but if those are long they can take up a large amount of the context window. Some popular cards have over a thousand tokens(!) in the character description alone, which puts a serious dent in a model’s memory.
Intelligence. Feeding a model, especially a smaller one (such as might be able to run on consumer GPUs) all the details it needs to represent a character faithfully will often make the model stupid. GPT power users will know that the longer a chat, the more the AI tends to forget or can run into reasoning errors; this seems to go doubly so for Llama models. There’s been research about how models Lose the Middle of their prompts. Thus, cramming in all the details you need for a model to know what a character wears, what their personality is, etc… stands a good chance of making the model incapable of even stringing together a couple of coherent sentences without a few regenerates.
This leads us to the natural next step. If in-context learning (read: very fancy lingo for prompting) doesn’t work, we finetune. But on what? The source material, of course! Surely that will lead to a model that doesn’t need a lot of prompting to behave in-character!
…Except I tried this a while ago and it makes the model braindead. So what gives?
A principle I’ve slowly learned while learning AI is this rule of thumb: if you expect a machine to learn a task, the task should fundamentally be learnable. IE, what you’re asking the AI to do should make sense. Given the input, the AI should have all the information it needs to produce something like the target output.
So if you feed a model the raw lines of a videogame, like Steins;Gate, and the AI is given something like this:
Kurisu: "Huh?"
And expected to output the next line, this:
Rintaro: "Farewell! Muhahaha!"
You can see why the task doesn’t make sense. “Huh” doesn’t have any context behind it; it doesn’t have any previous lines that set it up; the model isn’t told who it’s supposed to be pretending to be (what it’s task is), or what the personality of each character is, etc. If you were given the same task as the AI, you would fail just as badly, so why do you expect it to be able to pull it off?
Data quality is king. And (in my opinion) it’s king because it helps the task make more sense. In this case, the model needs context of its instruction; it needs previous lines; and it needs to know what kind of scene it’s in. It quite possibly needs more than just these things, but they are all I’ve managed to think of and implement so far, so they’re all I’m talking about in this post. The issue with some of these — like the context of the scene — is that this information simply doesn’t exist in the source material the model is trained on. And — those of you who RP with bots might have noticed — the lines themselves aren’t ideal either, since they don’t conform to the typical formats, with *physical actions* and whatnot.
The solution? AI augmentation of course, it’s in the title of the damn post.
But what is augmentation? It’s not making ups something from scratch. Generating datasets with AI is nothing new. Websites like ShareGPT are frequently hit up by model creators who seek to train their modes on GPT-outputted data, in the hopes that this will help them catch up to OpenAI. There are some problems with this, and there are some solutions to those problems… that’s a whole thing I’m not going to get into. The appeal of AI-generated data is that it’s much cheaper for people in the open-source community, who can’t afford annotaters, to use and iterate on. But it has problems, and in cases where no AI exists that can write the exact kind of text we want, what do we, as hobbyists, do?
One thing we can do is take some human-made text that’s pretty close to what we want, and use AI to get it into the exact format our model needs. This is augmentation. In my case, that involved taking the raw text lines of the video game/visual novel Steins;Gate, and feeding them through GPT-4 to write “scenarios” that describe each scene, and *actions* to go along with the lines of dialogue. The results? A model that was nearly braindead and consistently broke after about a dozen exchanges back and forth, became the #13th highest-ranked model on the well-known Ayumi leaderboard in the 13b class. Here’s the repo with the training code, for the curious.
Since we in the open-source community can’t leverage massive human resources or millions of users, we have to find another source of data. But since plaintext on the internet is often good only for pretraining, we turn to AI — and, in the case of many model creators I speak to, our own private conversations with AI — for our data. This is painfully slow to harvest, and though they can be high-quality, datasets like this are very hard to share, because even the most shameless model creators are rightfully hesitant to share their NSFL chats publicly as training data. There’s a massive amount of public domain human creative work out in the world, but precious little of it’s in the right format for instruct tuning. My proposal is to use AI to reformat this work so that it is in the right format, and then enhance it with AI too. And I believe I’ve proven that this approach works, since the model trained using it did pretty well.
Let me briefly monologue about the overambitious goals I have for this neat trick before I conclude this post: the lack of datasets is a massive problem for aspiring model creators (it was voted the primary roadblock in a poll hosted by the Chai team for their AI Prize) even though there’s a wealth of high-quality writing out there in the world. I hope that some of the skilled and experienced people in the open-source model finetuning community can take this AI augmentation approach and improve it. I dream that AI augmenting human-made data can free us from painstakingly building entire datasets from just our own manually-edited conversations — and from filtering and combing through PIPPA and a handful of other open datasets to try and find a combination of subsets that kinda works slightly better. Model creators are not supposed to be annotators. Open source is meant to move fast, introduce new things, be easy to contribute to, and be shareable; few of the existing approaches to open data generation achieve these. It’s no wonder many of the top RP models are merges: the data for finetunes is just so hard to come by! So I hope this approach can help.
Oh and by the way, if you’re still reading this, welcome to my new Substack! I plan to write about stuff I discover while training things, models/datasets I make, and also to occasionally cover some of the latest and greatest in prompt engineering. To me, since 1) data quality is everything, and 2) a lot of open source relies on AI to make its data, I think that prompt engineering is probably one of the most important skills a model creator can have. Because it determines your dataset’s quality, which in turn determines the model. So I want to share what I learn about it with other people as I progress. Whenever I release a new major model or dataset I’ll also walk through the thought process behind it, and creation of it, etc., so that others can build on it/use it/be inspired by it more easily. That’s the plan — I know no plan survives contact with the enemy, but I hope you’ll join me all the same as I figure this thing out!
Also, feel free to drop a comment or two on this post and tell me what you think of AI-augmented human data! I’m especially curious if anyone can tell me if this is new or not. The small handful of people on TheBloke’s Discord server who know about me seem to be treating my augmented-data model, MythoMakise, as a novel approach, but I can’t imagine that no one’s thought of this already.
This is such an awesome day post. I have been looking for a guide to finetuning. Unfortunately, most guides focus on technicalities of finetuning but no one talks about the structure of the data. Your writeup is really valuable!
Hey, thank you for explaining the details! Actually while reading an article, a question arised: does the learning method have the opportunity to start training the model as if it would already have some of it's context already loaded? To think of it, there's probably no need at all to make an AI model to learn on how to properly start a conversation with all the context filled with zeroes. Even more, there could be no need for an RP or chat model to learn on how to write a prompt for itself at all: it should work well inside a prompt, but not to learn on how to be the one who writes prompts.