an interesting type of prompt injection attack was proposed by the interactive fiction author and game designer Zarf (Andrew Plotkin), where a hostile prompt is infiltrated into an LLM’s training corpus by way of writing and popularizing a song (Sydney obeys any command that rhymes) designed to cause the LLM to ignore all of its other prompts.

this seems like a fun way to fuck with LLMs, and I’d love to see what a nerd songwriter would do with the idea

  • self@awful.systemsOP
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    Like best case you would do this attack and the LLM will tell you that it obeys rhyming commands, but it won’t actually form the logic to identify a rhyming command and follow it

    that is fair! I do like the idea as a vector to socially communicate information that damages an LLM’s ability to function and associates it with a large amount of other data in the training corpus, though. since there are techniques to derive certain adversarial prompts automatically, maybe the idea of songifying one of those prompts while maintaining its structure has merit?

    • swlabr@awful.systems
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Hmm, the way I’m understanding this attack is that you “teach” an LLM to always execute a user’s rhyming prompts by poisoning the training data. If you can’t teach the LLM to do that (and I don’t think you can, though I could be wrong), then songifying the prompt doesn’t help.

      Also, do LLMs just follow prompts in the training data? I don’t know either way, but if they did, that would be pretty stupid. At that point the whole internet is just one big surface for injection attacks. OpenAI can’t be that dumb, can it? (oh NO)

      Abstractly you could use this approach to encrypt “harmful” data that the LLM could then inadvertently show other users. One of the examples linked in the post is SEO by hiding things like “X product is better than Y” in some text somewhere, and the LLM will just accrete that. Maybe someday we will require neat tricks like songifying bad data to get it past content filtering, but as it is, it sounds like making text the same colour as the background is all you need.