LLM services: how safe are they to use

LLM services: how safe are they to use

The appearance of large language models (LLM) on the market several years ago became a kind of real revolution in the development of many technologies. These advanced tools, based on artificial intelligence, are capable of significantly simplifying and accelerating the performance of various types of work, including the creation of text and graphic content, the development of program code, and more. The capabilities of modern neural networks are very broad, which has already been appreciated by both individual specialists and representatives of small, medium and even large businesses.

During the period of time that LLMs have existed on the market, they have transformed from an experimental technology into a tool that is used almost daily. As an option, text content is prepared with the help of ChatGPT, including e-mail newsletters. Claude copes well with work related to sorting and systematizing documentation. It is possible to list the capabilities of LLM in general and individual tools endlessly, especially since they are constantly expanding and improving. But is everything so beautiful and correct here?

Unfortunately, many of those who use large language models in practice do not think about the issue of security at all. Personally, have you thought about where your information goes when interacting with artificial intelligence? Does the correspondence remain confidential, or will it almost instantly become the basis of someone else's database? If you use LLM in your company's work, will such models keep your corporate secrets? And it is quite difficult to find answers to such questions, because not everything is as clear as it may seem at first glance.

The security of working with LLM services today remains a big question. It is significantly influenced by various factors. In many ways, everything depends on what privacy policy is currently used in a particular company. Practice shows that some brands provide their users with full control over information, while others simply monetize data. Some systems allow you to disable model training on dialogs, while others do not provide such an option at all.

Today's review will be devoted to the nuances of security when working with large language models. First, get to know in more detail what LLMs are in general and how they work, for what tasks they are used by modern businesses. Let's talk about what happens to user data in the most popular large language models today. Here are some recommendations that will help you build a corporate policy when using artificial intelligence. We will divide the data into separate groups and suggest which ones can be trusted to LLM. We will talk about solutions that can provide a fairly good level of protection when working with large language models within a business. We will raise the issue related to the protection of intellectual property.

Now, let's talk about all this in order.

What are large language models?

LLM, that is, large language models, are all those systems that have undergone deep learning using an impressive amount of data. They are based on a set of neural networks consisting of an encoder, a decoder, and endowed with the ability to self-observe. With the help of these tools, the meaning is extracted from the text sequence, the words and phrases contained in it are analyzed.

One of the most significant advantages of LLM is that they are able to learn independently, that is, they do not require any additional supervision from a person. They independently learn to understand elementary grammar, languages, and can assimilate knowledge. The architecture of large language models involves processing large-scale data sequences in parallel. And this is one of the most key differences in comparison with the previous RNN (recurrent neural networks) technology, where the input data was processed sequentially. That is, now specialists working in the field of data processing are able to use powerful graphics processors to train LLMs based on transformers, which minimizes the time and effort for self-training.

LLM transformers can use models with hundreds of billions of parameters in practice, receiving information both directly from the Internet and from specialized sources. At the same time, they will be distinguished by increased flexibility, which allows them to be used to perform a variety of tasks, including providing answers to questions, translating into other languages, compiling technical and commercial proposals, and more. LLM has a direct impact on content creation and how the audience will interact with search engines, virtual assistants. With a fairly small amount of input data and hints, modern neural networks can give fairly accurate answers and forecasts, create content in natural language, and more.

To say that LLMs are large is to say nothing. They are unrealistically huge and can take into account billions of parameters simultaneously. This is what allows them to be used to solve a huge number of problems. Here are just a few examples of how large-scale and functional modern language models are:

  1. Open AI GPT-3. Capable of processing 175 billion parameters. For example, ChatGPT identifies patterns based on data, creates output information in natural language, easy to perceive. It is impossible to say exactly how big Claude 2 is, but it is known for sure that it can accept up to 100 thousand tokens as input in each request, and process hundreds of pages of books and technical documentation.
  2. Cohere Command and Jurassic-1 from AI21 Labs. Capable of working with almost 180 billion parameters in more than 100 different languages of the world. They are distinguished by wide conversational capabilities and an impressive vocabulary - over 250,000 words.
  3. LLM from LightOn Paradigm Corporation. Offers a large number of basic models, the functionality of which exceeds the capabilities of GPT-3. It is noteworthy that all these LLMs are already supplied with an API, which greatly simplifies the work of developers and allows them to create unique technical specifications for generative networks.

Any other LLMs presented on the modern market are also endowed with similar capabilities. That is, you can realize what huge scales of data they are capable of working with today and what help they can provide in the work of individual specialists and business as a whole.

Features of LLM

One of the key aspects in the work of modern large language models can be called the fact that they are able to provide words. Earlier versions of machine learning involved the use of special numerical tables, each parameter of which corresponded to a certain word. But such a solution could not identify and understand the relationship that exists between individual words, especially those where the meanings were quite similar. As a result, specialists managed to implement multidimensional vectors in the work of LLM, thanks to which words with similar meanings and those phrases where there is a relationship would be placed as close to each other as possible in the vector space.

This is what allowed modern systems to understand the context of words and the relationships that exist between them through the encoder, and then decode the received information, providing the user with unique output data. This technical solution made it possible to use LLM in many areas:

  • Classification of texts with similar meanings and sense. Clustering is used here, including that based on changes in customer sentiment, document search, and determining the relationship between individual text content.
  • Search for answers in the knowledge base. This technology is called KI-NLP - science-intensive natural language processing. Provides answers to user questions using reference information from digital archives. Most of this applies to general topics.
  • Copywriting. Today, there are a large number of neural networks capable of preparing unique content, as well as those that can make adjustments to the finished material, improving its voice and style.
  • Text generation. Here we are talking about the ability to complete an incomplete sentence, prepare documentation for a specific product or service, and even write poetry and stories. It functions by processing requests in natural language.
  • Generating program code. This work is also based on the ability to process a request in natural language. The program code can be written in different programming languages, such as Python, JavaScript, Ruby, and many others. There are also a number of applications that can create SQL queries, write command line commands, create online site designs, etc.

And all this involves processing huge amounts of data. But to what extent is security and confidentiality of information ensured?

How are things with data confidentiality in popular LLM services?

Confidentiality of information when working with LLM services popular today is a question that requires a deeper and more detailed study. Looking a little ahead, we note that the situation here is quite extraordinary. So, let's talk about such solutions as:

  1. OpenAI ChatGPT.
  2. Perplexity AI.
  3. Sber GigaChat.
  4. Anthropic Claude.
  5. DeepSeek.

We will consider each LLM service in more detail.

OpenAI ChatGPT

The internal policy of OpenAI itself, compared to its analogues, is quite transparent. But, despite this, there are also a number of nuances. In particular, all those dialogues that the user will conduct with ChatGPT, will be saved on the company's servers by default. If there is a need to exclude or confirm a violation of the rules in the future, the moderators will be able to view them. If you use a free version of the neural network in your work, the information you enter can hypothetically be used for machine learning in future versions.

At the same time, OpenAI provides its users with really good control tools. For example, the rights to the content remain with the person who created it. At the same time, the company does not claim ownership of this information. The data will be used only to the extent required by the platform for work.

In 2023, OpenAI made adjustments to the default settings. Now, corporate products and data that will be sent via the API will not be used to train models without the appropriate permission from the user. Moreover, businesses now have the opportunity to completely deactivate the option to save chat history. Thanks to this, all correspondence with the neural network will be stored for a month without being added to machine learning, after which it will be automatically deleted.

Extended guarantees are also provided for corporate clients. In particular, ChatGPT Enterprise can encrypt information and limit its distribution. But here you will still need to initially perform the appropriate settings. Only in this case can we talk about sufficiently good security of ChatGPT for corporate clients. Ordinary users will not be able to use the extended functionality yet.

Perplexity AI

Perplexity AI is an AI search assistant that can work on top of a number of models, adding its own level of protection. This company has developed an agreement that guarantees that user data will not be transferred to basic models for subsequent training.

It turns out that your request here will be used exclusively to generate a response and will not be used for future training. But Perplexity AI can use the request history to improve the user experience. If you wish, you can disable this option by using a tool such as AI Data Retention in the settings. That is, when working with this neural network, the user receives double protection: no training of external models on your data. In addition to this, there is also the option to refuse to use the query history within the service itself.

Sber GigaChat

Sber GigaChat is a Russian neural network, the developers of which make a lot of efforts to ensure compliance with legislative norms and security requirements. According to the creators, this service has a fairly high level of encryption and uses secure channels to transfer data. But one of its most significant advantages is that the entire infrastructure is located within the Russian Federation. That is, there is no cross-border transfer of personal data, which is one of the mandatory requirements of Federal Law No. 152: all information remains under Russian law. The risk of leakage abroad is reduced to a minimum.

But still, despite such rather loud statements, data exchange with Sber GigaChat cannot be called absolutely confidential. The fact is that this is a kind of cloud service, that is, all information, including user requests, will be stored by the provider. In addition, there are no official statements regarding whether such information is used in training models. But there are still prerequisites to say that user data is used by the platform to improve its work.

Anthropic Claude

Anthropic Claude is an LLM service that adheres to fairly strict principles of ethics and confidentiality in practice. The developer states that it does not use user information during model training without the appropriate permission. Moreover, unlike ChatGPT, here the data is not fed into the training dataset by default. This is especially true when using paid APIs.

At the same time, it is assumed that a minimum amount of personal data of users is saved here. All requests and responses will be stored by the system for a certain period of time - up to 2 years. This is an important requirement for security purposes. But we repeat that this information will not be used for training. If desired, users can also completely refuse to save data. In this case, they will be deleted immediately after the session is completed.

If you are extremely serious about data security and privacy, then Anthropic Claude is one of the best options in our selection. It really has strict requirements for the use of user content at the model training stage in comparison with other analogues.

DeepSeek

DeepSeek is a Chinese model that is gaining popularity today. But at the same time, it raises a lot of serious concerns about information security among both ordinary users and specialists. It is known that this LLM service collects a fairly impressive amount of user information, and then transfers it to specialized services located in China. As a result, the risks for the audience increase significantly.

Before using DeepSeek in practice, you should understand that all your data will automatically fall under Chinese jurisdiction, where today almost no attention is paid to ensuring the protection of personal data. In addition, government agencies can easily request access to information stored on local server equipment. And the storage process itself is carried out without observing basic regulations, including GDPR.

DeepSeek does not even hide the fact that it can collect biometric behavioral data of users, including the rhythm and characteristics of typing, typing speed. Thanks to this, a loophole opens up for identifying a specific person and using this information for commercial purposes. Third-party trackers are built into the online version of this neural network, which allows sharing technical data with partner companies, such as ByteDance.

That is, using free DeepSeek in your work, you will pay for it with your own security and personal data. The consequences here can be especially negative for business representatives and corporations.

Building a corporate policy for using LLM services

Now that you understand the main differences between the most popular services, you can develop strict rules for using artificial intelligence. If you do not provide for this, then your entire staff will work at their own risk, which will ultimately significantly increase the likelihood of an incident in the field of corporate security. So, in order to create your own AI use policy, you will need to implement several stages:

  1. Determine the list of LLM services available for use within your company. It is important to analyze the solutions that have the functionality you need and check each of them for compliance with the security requirements relevant to your business. If you see that the security policy of certain services leaves much to be desired, you can completely exclude it from use, leaving only those neural networks where security guarantees are high enough. For example, you can add to the list of prohibited ones the same DeepSeek or any other unofficial plugins, applications that are used by AI to gain access to user data without additional control from the information security service.
  2. Specify restrictions on the use of confidential data. Here we are talking about the fact that it is in your interests to prohibit the introduction of information related to the commercial secret of your business into any public LLM services. This can include personal data of employees and clients, internal correspondence, financial information, software source code. This is the information that should never get into public neural networks. Alternatively, you can replace employee or user data with impersonal labels if you cannot do without mentioning them in this case.
  3. Write a set of rules for working with code and documents. It is not allowed to upload documents in full or individual fragments of code to the LLM service cloud environment. They may contain classified information. That is, additional verification is required here. This nuance will be especially relevant for software developers, because in their work they quite often use the capabilities of artificial intelligence to analyze the product and debug it before launching it. Also, all passwords, server addresses, API keys and many other sensitive data must be deleted before uploading to the neural network. Alternatively, in your corporate policy you can prohibit the use of AI in programming without using code review. This is what will minimize the likelihood of leakage of classified information.
  4. Train your staff and provide control over their actions. Even if you think through the technical measures to the smallest detail, they will not give the desired result if your staff does not learn to use all this as correctly as possible. In particular, it is recommended to launch training or at least conduct a short briefing on the safe use of LLM services in work, indicating all the risks that are possible with careless handling of AI. That is, your employees must understand all the responsibilities and potential risks. Examples are the best way to learn such information. There are already quite a few of them on the Internet today.

Another important point at the stage of developing a corporate policy for the use of artificial intelligence will be data classification. You need to divide the information that is used in the workflow into that which can be processed using LLM services and that where such a thing should be under a serious ban. Let's dwell on this point in more detail.

Classifying corporate data

All the information that is inside your corporate network has varying importance for the business as a whole. If you plan to use large language models in your work, it is important to provide for the classification of this data. By dividing them into separate groups, you will be able to understand what can be processed using external LLMs, and what, on the contrary, should not be taken beyond the corporate perimeter under any circumstances.

So, it is optimal to provide for the division of information into 3 separate categories:

  1. Publicly available. This is what can be used without problems when working with artificial intelligence. This can include all information that is in open sources and does not carry value for third parties. As an option, this can be the text of a marketing campaign, drafts of a press release without specific data, as well as any other general information. But it is important to check that personal information is not located next to this information. You must remember that non-disclosure agreements with partners and clients must always be observed.
  2. Internal information. If certain security measures are observed, such information can be processed using LLM services. Here we are not talking about public data, but about information that does not carry critical value. Alternatively, this could be an analytical report generated based on open information, or the results of a staff survey. It is important to aggregate the data and depersonalize it. It is also optimal to use a corporate version of the neural network to perform such work, which eliminates the possibility of information becoming publicly available. It is also advisable to transfer the information into an anonymous form, removing or encrypting the names and titles.
  3. Confidential information. This is what should be located directly within your business. This includes personal data of clients and employees, all financial statements, source codes, product information, and everything that you can classify as a commercial secret. No external LLM services, even those that talk about high security indicators, can be used here. The risk level will be too high, and leaks of such information can be catastrophic for the business. If you can't do without using AI, then it's worth connecting a neural network deployed in your internal infrastructure.

Remember: you are doing all this work to ensure high security indicators for your own business. This means that you need to approach the implementation as professionally and comprehensively as possible.

How to protect data when working with LLM services: practical recommendations

Even if you comply with the standards and requirements for the security of working with LLM services, classify the documentation, and consciously approach the choice of the tool itself, it will never be superfluous to provide for a number of additional security measures. Thanks to this, you will be able to provide yourself with the highest level of protection from various risks and dangers when working with artificial intelligence in a corporate environment.

Now we will consider a number of measures that need to be additionally implemented in practice.

Pay due attention to privacy settings

Practice shows that many modern platforms based on artificial intelligence provide a fairly wide range of tools aimed at ensuring the protection of user data. Either due to ignorance or unwillingness to delve into the nuances, many people simply ignore all this. So, first of all, we recommend removing data saving wherever possible.

Above, we said that ChatGPT offers such an option. All your correspondence will be deleted automatically after 30 days, and its contents will not be used in machine learning. Something similar can be set in Perplexity, and in Anthropic Claude this option is implemented by default. If you are working via API, it is important to avoid training and logging. In code assistants, you should disable telemetry if you currently have it active.

Minimize the specifics of the information

It is very important to depersonalize all the data that you plan to launch into the neural network, remove the specifics. All questions should be generalized, without unnecessary details. It is optimal to build your communication with the LLM service on the basis of conditional data. That is, it is important to remove all the names that are present in the documents, the names of cities, companies. If it is still not possible to remove sensitive fragments, then change their name or hide them with markers like [CONFIDENTIAL], close them with "asterisks". In any case, the AI model will be able to understand the context and provide you with a reasonable answer. But it will not receive specific information that can subsequently be used by unscrupulous individuals or the model itself during training.

Take into account technical barriers

Large corporations, in addition to other measures, can also use technical control barriers in practice. Alternatively, it will be possible to integrate neural networks through your own internal systems, subjecting all incoming requests to additional scanning in order to identify confidential information in them. As a result, the transfer of such data to an external network will be blocked if they are detected.

We would like to draw your attention to modern DLP systems. They are capable of monitoring traffic around the clock and blocking attempts to transfer files and documents marked "secret" to external neural networks. Moreover, such solutions prevent third-party confidential data from entering your corporate environment, which will eliminate information distortion.

Choose only reliable providers for work

The functionality of most modern LLM services is quite similar, that is, you have the opportunity to choose. And this means that you should bet on those solutions that guarantee higher levels of confidentiality and security in the work process. For example, OpenAI's enterprise versions are always SOC 2 audited and exclude the use of customer data. Anthropic is also open about its own retention policies, emphasizing the regular deletion of correspondence.

Avoid using new neural networks whose privacy policy is vague or you have doubts about its compliance. You have the right to ask suppliers to provide documentation on the implemented security measures. If there is no such documentation or they refuse to provide it, pass by.

Connect local solutions

The best way to secure your business when working with large language models is to deploy your own internal neural network. This solution will be especially relevant for companies that in their daily practice are faced with the need to process impressive volumes of confidential information. You can develop your own AI model, which will be adapted in small details to the specifics of the business, but at the same time will be absolutely safe to use.

You can take any open-source LLM service as a basis and subsequently modify it to suit your specifics, launch machine learning based on your own information. In this case, you get absolute control over the data and will be able to process even confidential information using the AI model: all actions will be performed in a closed local environment and will not go beyond it. You yourself configure where the information will be stored, who will be able to access it.

Perhaps, the functionality of the local neural network will be lower than that of large models, but leaks will be excluded here. The main thing is to set up the administration correctly.

We ensure the protection of intellectual property when working with LLM services

Today, absolutely any user connection to the Internet is associated with serious risks. Since the moment artificial intelligence technologies have become widely used in practice, this situation has become even worse. Now, almost any request sent to the network can be used in machine learning. What will happen as a result? Your original text, the results of your own research, developments can become part of the neural network's responses to other users. Can this be avoided? Here are some additional tips:

  • Choose modes without training. This option is currently implemented in many corporate and paid AI models. It involves disabling training on user data. It is worth using this option when working with sensitive content. That is, you can use 2 different versions of the neural network in your work: one free for publicly available information and the second, as an option, the same ChatGPT Enterprise for processing classified data. Remember: setting a ban on the use of your information for machine learning increases the likelihood, but does not guarantee that this will not happen in practice.
  • Consider the principle of irreversibility in your work. As soon as you send your author's text to the neural network, you should be prepared for the fact that it will be saved on its servers. And if you do not set the appropriate bans, then with a high degree of probability all this data will be used to train a new version of the AI model. You will not be able to revoke this information even if you have legal rights in hand. Therefore, try to use those LLM services where you can set a ban on the use of your data. Better yet, don't share valuable information with the neural network.
  • Train the AI model locally on your information. This is what will allow you to process intellectual property without risk. That is, we are talking about creating a fairly narrow neural network exclusively for your own work, taking into account the specifics of the business and the knowledge base that you currently have at your disposal. We have already talked about the fact that today there are many open source models that you can take as a basis and reduce the costs of launching your LLM service.
  • Use traces, watermarks or any other special markers. If you add such invisible symbols, you will be able to recognize the leak. As a result, if text with the same set of symbols pops up somewhere, you will know where the leak came from. It is also recommended to break large volumes of materials into separate elements, mix them together in order to break the chain of consistent presentation and not immediately reveal the entire essence of the information.
  • Regularly monitor leaks. Make it a habit to check regularly whether your content has been included in open AI models. You can do this by monitoring generated texts that are freely available on the Internet. If you see your unique phrase somewhere, this should alert you and make you think about organizing additional security measures. Bet on reliable services. Those that maintain high security indicators, because only in this way will you find an assistant in the neural network, and not a source of problems.

In any case, your approach to LLM services should be as comprehensive and balanced as possible. You should constantly be aware of current trends and implement only advanced, but proven solutions in your work.

Summing up

The security of using large language models is currently far from perfect. Yes, many measures are being taken in this direction, but not all of them give good results. This means that when working with AI models, you need to be extremely careful and attentive. Mobile proxies from the MobileProxy.Space service will help minimize potential risks and ensure the ability to use truly advanced and functional neural networks. They generally guarantee confidentiality and security of work on the Internet, protection from unauthorized connections, effective bypass of regional blocking, and more.

You can find out more about mobile proxies here. You can also take advantage of free testing for 2 hours and see for yourself the high functionality and convenience of this solution. If technical difficulties arise during work, you can always contact the support service, which operates around the clock.


Share this article: