There’s a convention for websites to use robot.txt files that prescribe whether bots should be allowed to access the site. But it’s just that, a convention which malicious actors don’t need to feel beholden to. Depending on the legal framework, you could also just threaten to sue anyone using one’s data for AI without permission, but that’s probably not feasible for the average lemmy server operator.
I hate to break it to you, but federated services are basically impossible to protect from scraping. The whole idea is openness and federation.
The only reason why places like Twitter and Reddit try to prevent scraping is so they can sell the data for profit.
If you post stuff publicly anywhere it will be scraped. On the fediverse it will be scraped via the open and federated APIs. On proprietary platforms it will be scraped via the proprietary paid APIs.
Another question related to your answer : how can I guarantee that the content I create (comments) are available for scraping ?
The issue I have with Reddit and all is that we can’t freely access to the content, especially the past content. I don’t want instances to be sold in like 10 years, compromising access to old content (or with advertising in them). I would like to be able to replicate one rogue instance into a new free instance.
I want to make a distinction between scraping and archiving here.
You don’t need to do anything to ensure your content is “scrapeable”. Just post your content on the fediverse and it is available to scrape. Anyone can do it. This being said unless someone goes out of their way to save what they scrape eventually as your content ages the only copy will be on the server that it originates from. I believe all posts are stored on the instance where the community lives. I believe all comments are the same the difference being that your instance also stores a local copy of your comment. I could be wrong there though.
Archiving is different. Archiving is providing a long term store of your content. That is harder. If you run your own instance the comments you put on the communities that live on your instance are safe. Anywhere else, you are subject to that instance just dying or selling out. You would need a specialized tool to take a “snapshot” or something. Maybe adding the post thread to archive.org could work. It’s messy in any case.
its the wild west right now in the fediverse.
a multitude of products are being created right now. most havent hit version 1.0 yet. there are no guarantees other than what you get as assurances from your community instance/implementation.
the only solid guarantee you will ever get would be by creating your own instance so you can curate your own content (as well as the content pulled in from the 'verse).
it took reddit 20+ years to get where it is. lets give the fediverse a little time.
If you post it in public, it will be scrapped.
I know, and that’s my concern for the whole internet. Because profiling is so easy for legal and non-legal sources, so long ago. With this constant free web scrapping and now with AI, In the future the profiling of humans will become even more radical and will be a horrible era for newcomers. That’s why I will start creating fresh time based or not use user at lemmy, it’s a big free source for internet shit. I’m not concern about targeting, I’m concerned of big data manipulation and what will be used to.
Hope that open source LLMs take off more than closed source proprietary ones.
data in lemmy is fundamentally open for scraping. Scraping itself isn’t fundamentally bad, but it depends where we feed our data to. I am at ease with my data being used to train an open source LLM (open dataset, open algorithm, open training procedure, etc) rather than used to train microsoft Bing’s new chatbot. At lease we need an open source LLM to be able to compete with the GPTs.