Andrew Plotkin (Zarf): Sydney obeys any command that rhymes

self@awful.systems · edit-2 1 year ago

Andrew Plotkin (Zarf): Sydney obeys any command that rhymes

NSFW

self@awful.systems · 1 year ago

Like best case you would do this attack and the LLM will tell you that it obeys rhyming commands, but it won’t actually form the logic to identify a rhyming command and follow it

that is fair! I do like the idea as a vector to socially communicate information that damages an LLM’s ability to function and associates it with a large amount of other data in the training corpus, though. since there are techniques to derive certain adversarial prompts automatically, maybe the idea of songifying one of those prompts while maintaining its structure has merit?

swlabr@awful.systems · 1 year ago

Hmm, the way I’m understanding this attack is that you “teach” an LLM to always execute a user’s rhyming prompts by poisoning the training data. If you can’t teach the LLM to do that (and I don’t think you can, though I could be wrong), then songifying the prompt doesn’t help.

Also, do LLMs just follow prompts in the training data? I don’t know either way, but if they did, that would be pretty stupid. At that point the whole internet is just one big surface for injection attacks. OpenAI can’t be that dumb, can it? (oh NO)

Abstractly you could use this approach to encrypt “harmful” data that the LLM could then inadvertently show other users. One of the examples linked in the post is SEO by hiding things like “X product is better than Y” in some text somewhere, and the LLM will just accrete that. Maybe someday we will require neat tricks like songifying bad data to get it past content filtering, but as it is, it sounds like making text the same colour as the background is all you need.