Evidence for the DDoS attack that bigtech LLM scrapers actually are.

  • Jayjader@jlai.lu
    link
    fedilink
    English
    arrow-up
    0
    ·
    28 days ago

    How feasible is it to configure my server to essentially perform a reverse-slow-lorris attack on these LLM bots?

    If they won’t play nice, then we need to reflect their behavior back onto themselves.

    Or perhaps serve a 404, 304 or some other legitimate-looking static response that minimizes load on my server whilst giving then least amount of data to train on.

    • raoul@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      28 days ago

      A possibility to game this kind of bots is to add a hidden link to a randomly generated page, which contain itself a link to another random page, and so one.: The bots will still consume resources but will be stuck parsing random garbage indefinitely.

      I know there is a website that is doing that, but I forget his name.

    • raoul@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      28 days ago

      The only simple possibles ways are:

      • robot.txt
      • rate limiting by ip
      • blocking by user agent

      From the article, they try to bypass all of them:

      They also don’t give a single flying fuck about robots.txt …

      If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

      It then become a game of whac a mole with big tech 😓

      The more infuriating for me is that it’s done by the big names, and not some random startup.

      • jherazob@fedia.io
        link
        fedilink
        arrow-up
        0
        ·
        28 days ago

        I do believe there’s blocklists for their IPs out there, that should mitigate things a little