Organizations – across industry verticals – rely heavily on their data to derive relevant insights and make informed predictions. This helps organizations make appropriate business decisions, which aids their growth and sustenance. From predicting the required size of inventory of various merchandise in retail stores to estimating the probability of a customer defaulting on a loan, to determining the size of the workforce that a company would require over the next few years, to other sophisticated use-cases in medical sciences, pharmaceuticals, financial industry, insurance, telecommunications, retail, travel, tourism, and hospitality among others, there is virtually no domain now that is untouched by data analytics.
Data analytics requires data—and vast volumes of it. The more data available for analysis, the better the accuracy of predictions and the less bias introduced in the analysis outcome. While organizations might be sitting on piles of data gathered/created over the years, and more of it is being gathered/created on a routine basis, organizations should be cognizant that some of this data could be sensitive, confidential, or personally identifiable. Some such data may pertain to children.
Globally, many countries are creating their respective data protection regulations. Some countries already have their data protection regulations and are tightening them to make them more stringent and accommodate the needs of modern data handling practices. Such data protection regulations require organizations to ensure the privacy of data when data is collected, generated, stored, processed, or shared. In some such regulations, there are specific restrictions on processing data of individuals who are below a certain age. These regulations punish organizations with hefty penalties if they are found to be non-compliant in matters of data protection. Thus, organizations cannot afford an unbridled handling of data as per their convenience. With its ubiquitous presence, data analytics make the situation more challenging for organizations. Added to data protection regulations is a new breed of regulations that focus on the trustworthiness of AI. In some geographies, efforts are underway to bring focused legislations, which require more control over how Machine Learning (ML) and Artificial Intelligence (AI) algorithms work on data. Some examples are The EU AI Act and Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, US. Global AI Legislation Tracker, published by International Association of Privacy Professionals (IAPP), sheds light on AI strategy envisioned by various countries. Among other things, such as the trustworthiness and ethical aspects of data analytics, these new proposed regulations pay attention to the kind of data input to data analytics. These regulations are in varying stages of being drafted and debated. However, it is safe to anticipate that many mature economies will enforce some such regulation on organizations from their respective geographies.
Thus, organizations must use “the right kind” of input data for building/training analytical models. Such data should not only enable ethical and trustworthy AI but also be privacy-conscious.
Privacy-safe data
An approach for enabling privacy-safe data for data analytics
This paper proposes a five-step approach to building privacy-safe data for data analytics. As described below, the fundamental premise is to synergize the power of various privacy-elevating technologies to develop and use a privacy-safe training data set that mirrors the original production dataset to a great degree.
In this way, multiple privacy-elevating technologies, such as data subsetting, pseudonymization/static data masking, differential privacy, and TGAN, are brought together to craft realistic training data for building analytical models.
Here is a fictitious example to illustrate the idea:
The above approach presents the following benefits to an organization, seeking to ensure the privacy of data, used for data analytics:
Considerations for the approach
As applicable to any data analytics process, the proposed approach is subject to the following considerations:
Remediation options
There are some nascent technologies to mitigate the risk of analytical data models based on actual production data, e.g., model unlearning and model disgorgement. Such technologies aim to remove the effect of a certain set of data from an existing machine-learning model. Such technologies are still evolving and have yet to pass the scalability, robustness, and cost-effectiveness test. Until then, organizations would fare well to ensure that at least the models being newly created use privacy-aware data.
Bad actors can misuse the power of mathematics to render analytical models susceptible to data privacy exposure. However, it is mathematics again that provides counter-tools that help "remove traces" of certain dataset records, which may have been used when building an analytical model. Technologies such as "machine unlearning" and "model disgorgement" are being considered to eliminate the effect of a specific type of data record that may have contributed to building a given model. Undeniably, these potent antidotes help reduce the risk of data privacy exposure emanating from analytical models. However, a more "privacy-by-design" approach would be to take care of data privacy in the initial stages of data analytics. The five-step approach presented in this paper, though certainly not a panacea, could help in this direction.