Generative artificial intelligence (GenAI) creates specific new challenges around data privacy and regulatory compliance.
Such challenges include the inability of models to meet compliance requirements like right to forget or right to erase and inability to meet regulations for data localization.
The large language models (LLMs), that power GenAI solutions, require vast amounts of data. Those models have been trained on raw and unfiltered data from the internet. This real data, while it may be superior for training these models, is likely to contain personally identifiable information.
Complicating matters, once real data is fed into a model, the GenAI will store and retain that data and may reveal personal information in ways that are not fully predictable or entirely understood. In other words, once real data is used to train a model in this way, the damage from a privacy standpoint may be irreversible. While unlearning is possible by retraining the model, it is prohibitively expensive.
Those who work with large amounts of data face increasing compliance obligations related to the use and protection of personal information.
The most prominent regulatory regime is the EU’s General Data Protection Regulation (GDPR). These rules address the circumstances under which an organization can collect, store and use personal information. Further, the rules establish a right to be forgotten, allowing individuals in some circumstances to have personal data deleted or removed from a database.
There are few clear answers yet to questions about how rules such as the GDPR will be applied to powerful LLMs and to the use of personal data in training these models. Privacy and GDPR compliance were cited by Italy’s data protection authority when it briefly banned OpenAI’s ChatGPT in 2023.
The adoption of the EU’s AI Act this year introduced more specific rules for the development, deployment, and risk management of advanced AI systems. This regulation is designed to complement the GDPR, which is the primary framework for personal data protection and regulation.
Outside of the EU, a range of new laws and evolving regulations may be relevant to AI adoption, including several that aim to harmonize with the GDPR on data privacy matters. Canada has the Digital Charter Implementation Act and the US unveiled an AI Bill of Rights with data privacy provisions.
Given the regulatory landscape, it should be clear that every organization needs to anticipate greater scrutiny of how generative AI solutions affect data privacy and security.
As legislation and rules are refined, regulators become more skilled in identifying potential misuse of personal data in LLM-powered applications, expect a higher bar for compliance.
Any business working today to implement new AI solutions should be addressing potential privacy regulation up front, not as something to figure out after these new tools are developed and deployed. This may run counter to what’s happening in many companies today, where there is a rush to find the right use case and get new AI technology in place ahead of competitors.
To proactively address data privacy concerns, organizations should ensure that the data used to train a model does not contain any personal information.
This can be done offline for a large data set prior to its use. Increasingly, there are also tools available that can help to scrub data in real-time as it’s being ingested by a model.
There may be tradeoffs to this approach. The power and effectiveness of a model depends on the data used to train it, and real data is often best for this purpose. It’s possible to scrub data of private personal information while also preserving the logic and relationships in the data that make it useful.
Organizations should prioritize building and training their own models. By tailoring its own model, an organization can have far greater oversight of the data used for this process and enjoy more control over potential data privacy compliance issues. By contrast, the use of generic or off-the-shelf generative AI models offers less transparency and control over the training data set.
Another benefit for an organization that tailors its own learning model is that the resulting application—a model trained for the purpose it is meant to solve—may be more accurate, and more likely to meet expectations, than a generic model trained with a broader data set.
Faced with growing obligations related to data security and data privacy, organizations need a governance framework that encourages proactive regulatory compliance.
The compliance team needs to work hand in hand with technology and business leaders, as new AI solutions cannot easily be modified after they are created. After that, efforts are going to be frustrated by the very design of generative AI models, which have their own ways of storing or learning data.
The concept of security by design for technology implementations needs to be complemented with privacy by design as well—while fully encompassing the generative AI solutions that are becoming ever-more important for every business or organization.