I have 64 zipped megabytes of AIM conversations I had in high school. how hard would it be to train an LLM to be me from 15 years ago?

ch00f@lemmy.world · 4 months ago

I have 64 zipped megabytes of AIM conversations I had in high school. how hard would it be to train an LLM to be me from 15 years ago?

keepthepace@slrpnk.net · 4 months ago

It is called finetuning. I haven’t tried it but oobagooba’s text-generation-webui has a tab to do it and I believe it is pretty straightforward.

Fine tune a base model on your dataset and then tou will then need to format your prompt in the way your AIM logs are organized. e.g. you will need to add “<ch00f>” add the end of your text completion task. It will complete it in the way it learnt it.

If you don’t have a the GPU for it, many companies offer fine-tuning as a service like Mistral

PerogiBoi@lemmy.ca · 4 months ago

Why would you want this??? Anything I wrote from 16 years ago is so beyond cringey. You must have been a stellar kid.

DaGeek247@fedia.io · 4 months ago

Because funy

corsicanguppy@lemmy.ca · edit-2 4 months ago

I have 26 years of saved outgoing email.

Recently I needed to redo a fix I learned about in 1998 and implemented then. I implemented it again to install a crappy software project that from its composition canNOT have been from before the post-y2k firing of so many mentors.

Only remembered after 3 hours of searching, saving myself another few hours and surely a nervous breakdown. But, after filtering AD on the client end, the project installed easily.

That’s the best example, but the things I don’t discover I answered already on Stackoverflow I discover I answered years ago in email.

istanbullu@lemmy.ml · 4 months ago

Not hard with Huggingface PEFT

will_a113@lemmy.ml · 4 months ago

Putting aside why you’d want to do this, it’d be pretty easy, actually. You’d still use a big model like GPT4 or Claude as your “base” but you would do two things:

Give it a knowledge base using your conversatons. You can manually vectorize them into a key-value database like Pinecone and build yourself an agent using a toolchain like Langchain, or just use a service (OpenAI Agents lets you upload data from your browser)
Have one of the big LLMs (with a large context size) ingest all of those conversations and build out a prompt that describes “you”

you would then

Feed that generated prompt (with your own edits, of course) back into either your custom Langchain agent or OpenAI Agent

fcano@infosec.pub · 3 months ago

You may try https://github.com/instructlab. You will need to transform those conversations to a specific yaml format.

wuphysics87@lemmy.ml · 4 months ago

The real question is why do you have 64 mb of aim conversations?

ch00f@lemmy.world · 4 months ago

Because I communicated with a lot of people over AIM? It’s actually more than just high school. Covers 2004 to around 2012. Also it’s 64mb zipped. Actual size is much larger.