Blocking the use of ChatGPT content on your site

22.05.2023

The article content

Every day artificial intelligence becomes more and more perfect, affect different areas of human life. On the one hand, this is intended to greatly simplify the solution of a part of the task facing us, but on the other hand — contains a number of points that require special attention. So, one of the most pressing problems today among webmasters can be called the use of large language models, the same ChatGPT content for training sites.

Now let's take a closer look at how artificial intelligence can learn from your content, and talk about two of the most popular web content databases. We'll also look at how to ensure that your content is blocked from using ChatGPT. But, first things first.

A little about how artificial intelligence is trained on your content

LLMs, aka Large Language Models, are machine-learned on parameters they get from various third-party sources. The vast majority of these data sets are built on open source, which means that artificial intelligence is free to use them for training.

Chat can receive information from different sources. Here are the most popular solutions:

Wikipedia;
e-books;
all sorts of documentation;
electronic correspondence;
Scanned web resources.

Today, online spaces provide a fairly large number of sites and specialized portals, the work of which is to form large-scale data sets containing a huge amount of information. As an example, we can highlight the Amazon portal, which today has collected several thousand data sets. And this is just one of a huge number of other sites containing such impressive sets of content. According to Wikipedia, there are about 30 such sources that allow you to download sets of information that can be used to train artificial intelligence.

Google Dataset and Hugging Face are other bright premieres of such portals. Today they have collected thousands of datasets with unrealistically huge amounts of data.

Introducing Popular Internet Content Databases

Let's take a closer look at the two most popular Internet content databases:

Open Web Text. They are a set of URLs from Reddit posts that have received at least 3 likes from the user. Thus, the system confirms the fact that the given site addresses have earned trust among users, which means that their content can be called high-quality and reliable. It is impossible to say exactly how the user agent of this is identified crawler . But, there is every reason to assume that if your site has a link from Reddit that has at least 3 votes from users, then with a high degree of probability it can be argued that your site is already in the OpenWebText dataset.
Common Crawl. Represents one of the most commonly used sets of parameters about the content of the Internet. The site is offered by a non-profit organization. The data from this service comes from the built-in bot of the – CCBot. He constantly monitors the Internet spaces. Information here may also be uploaded by companies that plan to use it. It must be cleared of spam sites. The Common Crawl platform works on the robots.txt protocol, which means that using the appropriate directives you can easily block it, which in turn will prevent your resource's content from entering one or another DATASET. But in the event that your site has already been indexed, there is a high probability that it is already in one or more sets of similar data. But still, if you choose to block Common Crawl activities, you can avoid your content getting into new datasets.

How to block CCBot in Common Crawl?

To block CCBot in Common Crawl, you need to go to your robots.txt file and add the following lines to it:

User-agent: CCBot Disallow: /

Also be aware that CCBot searches for content from IP-addresses as an additional way to validate the legitimacy of the user agent . Also, the data will be subject to the nofollow directives of the robots meta tag. In the latter case, you should use the following configuration command to block:

meta name="CCBot" content="noindex nofollow"

How to block your content from using ChatGPT?

The most interesting thing about all this is that the search engine allows Internet resources to refuse to participate in crawling. Common Crawl is no exception here. The main problem today – this is that there is no way yet to remove the content of your resource from the datasets that already exist today.

In addition, research scientists do not offer webmasters the option to opt out of having their content used to generate datasets. In view of these features, today the question of how legitimate ChatGPT's actions will be when collecting data from Internet resources without the appropriate permission from the site owners has become very relevant today. It turns out that you can neither forbid nor allow large language models to use your materials. Therefore, the issue of providing such a choice is quite relevant today. This will be especially important in the case when the material is collected not by ordinary people, but by various services based on neural networks, including ChatGPT. How legitimate is it that artificial intelligence will be trained completely free of charge on your unique content, and then will use the acquired skills in order to generate analogues for its users for money?

I'm afraid that's a question we've yet to find answers to.

Increasing the level of network security

Getting the site into the database for training artificial intelligence – this is just one of the problems that modern Internet users may face. The fact is that today there are many other dangers that affect not only the owners of online resources, but also ordinary users. We should also not forget about all kinds of hacker attacks with the aim of stealing personal data or downloading malicious software to a user device. But this problem already has a reliable and effective solution, namely — mobile proxies.

Such servers are capable of passing the entire data stream through themselves, while ensuring that your real IP and geolocation are replaced with their own technical parameters. This means that no site or program will be able to establish your actual address. The use of mobile proxies in practice provides:

high level of online anonymity: it is impossible to identify you as an end user;
reliable protection against any unauthorized access, including hacker attacks;
effective bypass of regional blocking, which will allow you to access any Internet resources, including those that are currently prohibited in your country at the legislative level;
faster connection, which is provided by the use of high-speed communication channels, as well as data caching.

It remains only to find a solution that will satisfy exactly your needs in terms of functionality, reliability and price.

Choosing the best mobile proxies

If you initially apply for the purchase of mobile proxies to the MobileProxy.Space service, then save yourself the long search for a suitable option. Among the distinctive features of this product, we highlight:

providing each user with a personal dedicated channel with unlimited traffic: only you will use it;
access to the millionth floor of IP-addresses, which you can change either automatically, by pre-setting the timer, or by force via a link from your personal account;
simultaneous operation on the HTTP(S) and Socks5 connection protocols, which is ensured by parallel connection to ports;
the ability to change the geolocation of a mobile network operator directly in the workflow, thereby bypassing any regional blocking;
A 24/7 technical support service that provides quick solutions to various complexities and problems in operation.

If you want to learn more about the functionality of mobile proxies from the MobileProxy.Space service, as well as the current rates, follow the link https ://mobileproxy.space/user.html?buyproxy. You also have the opportunity to take advantage of a two-hour testing completely free of charge in order to make sure that you made the right choice even before buying a product.