Data leakage risk
In recent reports, employees of a global industrial conglomerate inadvertently leaked sensitive data by using ChatGPT to check source code for errors and to summarize meeting minutes. These are the tasks that Large Language Models (LLMs) like ChatGPT excel at. While no direct public disclosure of sensitive data occurred after being entered into ChatGPT, the data could be used by ChatGPT’s creator OpenAI to train future models, which in turn could disclose it indirectly through future replies to prompts.
In the specific example of ChatGPT, the retention period for prompts is 30 days. Opt-in for training future models based on your prompts is on by default for free accounts and off for fee-based accounts. OpenAI also recently introduced a feature where you can turn off chat history for specific prompts for free accounts. Conversations started when chat history is turned off won’t be used to train the models, nor will they appear in the history sidebar.
There is, of course, also the immediate risk of erroneous accidental disclosure by ChatGPT itself. For a brief period, due to a bug, ChatGPT exposed the search prompts of other users in its interface.
To mitigate the risks associated with data leakage and LLMs when dealing with critical and sensitive information, use self-hosted copies of the models or cloud-provided ones, where the terms of use and privacy policy match your organization's risk appetite more closely. If this is not an option, enforcing limits on the amount of data fed into public models could lower the risk of accidental data leakage. This would allow the use of LLMs like ChatGPT for most tasks while preventing a user from copying and pasting large amounts of proprietary data into the web form for summarization or review by the model.