Usually, I talk about how to up your prompting game. However, really I specialize in two things: prompting, and synthetic data. Today, I’m doing a project spotlight.
The orange one is actually me.
In the past, synthetic data usually involved either captioning human-written text (i.e., you’d take a textbook and for each chunk you would have an AI write a prompt like “write a paragraph about [chunk subject].” Either that, or you just give a message to an AI, “Generate a conversation about XYZ topic” if you’re doing the pure-synthetic Airoboros approach. I sort-of pioneered a new method with Augmentoolkit, where you feed a pipeline information and the LLM uses that information as the basis for new data. But this weekend I played around with the “annotation” style of data creation. By finetuning (full finetune with GaLore) an AI a massive amount of text I’d written — from my few thousand personal notes, to my book, to this blog, even — I was able to get it to replicate my writing style, to a degree. It also picked up some of my opinions:
And it can write half-decent thoughts about prompting (this is a 7b):
Honestly it’s way more useful than I thought it would be. If I can’t figure out how to start a reply to someone, I go to my AI self. If I want inspiration, I ask my AI self to be inspired, and use that as a starting point. If I want to bounce ideas off of myself, I can. I find it’s also much easier to understand and empathize with an AI that speaks like you.
There are a few takeaways from this:
The “Annotation school” of synthetic data has some use. I definitely like it more than the pure synthetic school. The variety of the output labels is good. I’m thinking I want to combine annotation with augmentation to create some sets in the future.
Having an AI doppelganger is very useful for social and creative endeavors.
Writing style is more easily gained through finetuning than prompting.
GaLore tuning is really powerful and surprisingly fast and cheap.
Somehow the AI was able to tell me EXACTLY what it had been trained on, without that information explicitly being in the training set. Which makes me think that LLMs somehow have a “bird’s eye view” of their training set (if prompted properly). This needs more investigation on my part
If you want to recreate this experiment and create your own AI doppelganger, here’s my axolotl config. Run the command shown in the comment there on an H100 rented from Runpod, using the Axolotl docker image. My datasets and model are not public because there’s private information in there, but the config shows you how to full-tune models if you’re able to put together your own data. I used a mix of pretraining data (my book) and ShareGPT data for conversational stuff (this is where the annotation school comes in).
If you don’t want to format all your data and train it yourself, shoot me a DM, my contacts are at the bottom of the Augmentoolkit repo.
Alright that’s all for this week, sorry there weren’t any new prompting insights but I hope you found this project interesting and/or funny!
Thanks for reading, have a good one and I’ll see you next time.