Sometimes being kind doesn’t necessarily mean being nice.
One of the first things I needed to do after creating a Lemmy instance was block all the AI bots from scraping everything. It only took a day before ChatGPT, AmazonBot and Alibaba came crawling through. Alibaba was the most annoying as it didn’t even identify itself properly, as mentioned in the article, it looks like an Edge browser.
This comment seems ironic given the username.
This is a good start: https://github.com/wchill/revanced-patches
I’m using Boost for Reddit and Boost for Lemmy. The Boost for Reddit app requires several patches to keep working though, which makes it inaccessible for most people.
I think using LLMs with RAG (aka tools) is more useful and reliable than relying only on training data that the model does its best to represent.
For example, using a search engine to find results for a query, downloading the first 10 results as text, and then having the LLM answer subsequent queries about those sources, or another example would be uploading a document and having the LLM answer queries about its contents.
This is also advantageous because much smaller and quicker models can be used while still producing accurate results (often with citations to the source).
This can even be self hosted with Open WebUI/ollama.