A Five-step Approach for Enabling Privacy-safe Data for Data Analytics

Sumeet Bhide

Data Privacy Analyst, TCS Global Privacy Office

Nikhil Patwardhan

Product Engineering, TCS MasterCraft

Solution

TCS MasterCraft™

Highlights

While organizations are keen to harness the power of data analytics to derive meaningful insights from data, such data needs to be used in a privacy-conscious way.
Various data protection regulations, combined with the newly emerging regulations that govern the use of AI, require organizations to ensure that data usage is compliant and trustworthy.
One way organizations can enable a more controlled and privacy-focused data handling approach is to use an appropriate combination of privacy-enhancing technologies for the given business case.

Data for analytics

Need for large volume of data for analytics

Organizations – across industry verticals – rely heavily on their data to derive relevant insights and make informed predictions. This helps organizations make appropriate business decisions, which aids their growth and sustenance. From predicting the required size of inventory of various merchandise in retail stores to estimating the probability of a customer defaulting on a loan, to determining the size of the workforce that a company would require over the next few years, to other sophisticated use-cases in medical sciences, pharmaceuticals, financial industry, insurance, telecommunications, retail, travel, tourism, and hospitality among others, there is virtually no domain now that is untouched by data analytics.

Data analytics requires data—and vast volumes of it. The more data available for analysis, the better the accuracy of predictions and the less bias introduced in the analysis outcome. While organizations might be sitting on piles of data gathered/created over the years, and more of it is being gathered/created on a routine basis, organizations should be cognizant that some of this data could be sensitive, confidential, or personally identifiable. Some such data may pertain to children.

Need for privacy

Data privacy challenge in data analytics

Globally, many countries are creating their respective data protection regulations. Some countries already have their data protection regulations and are tightening them to make them more stringent and accommodate the needs of modern data handling practices. Such data protection regulations require organizations to ensure the privacy of data when data is collected, generated, stored, processed, or shared. In some such regulations, there are specific restrictions on processing data of individuals who are below a certain age. These regulations punish organizations with hefty penalties if they are found to be non-compliant in matters of data protection. Thus, organizations cannot afford an unbridled handling of data as per their convenience. With its ubiquitous presence, data analytics make the situation more challenging for organizations. Added to data protection regulations is a new breed of regulations that focus on the trustworthiness of AI. In some geographies, efforts are underway to bring focused legislations, which require more control over how Machine Learning (ML) and Artificial Intelligence (AI) algorithms work on data. Some examples are The EU AI Act and Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, US. Global AI Legislation Tracker, published by International Association of Privacy Professionals (IAPP), sheds light on AI strategy envisioned by various countries. Among other things, such as the trustworthiness and ethical aspects of data analytics, these new proposed regulations pay attention to the kind of data input to data analytics. These regulations are in varying stages of being drafted and debated. However, it is safe to anticipate that many mature economies will enforce some such regulation on organizations from their respective geographies.

Thus, organizations must use “the right kind” of input data for building/training analytical models. Such data should not only enable ethical and trustworthy AI but also be privacy-conscious.

Privacy-safe data

An approach for enabling privacy-safe data for data analytics

This paper proposes a five-step approach to building privacy-safe data for data analytics. As described below, the fundamental premise is to synergize the power of various privacy-elevating technologies to develop and use a privacy-safe training data set that mirrors the original production dataset to a great degree.

Figure 1: Privacy-safe data for building analytical models

Data subsetting: The data privacy principle of data minimization can be implemented by subsetting the input production data so that only the required or allowed data is filtered for data analytics. For example, some regulations may prohibit the data of minors or persons with disabilities from being used for targeted marketing. So, such data could be filtered out using data subsetting. Thus, data subsetting will help reduce the risk of non-compliance with regulatory requirements. However, data subsetting should be purpose-driven. If done arbitrarily, it can unintentionally limit the input data used for model building, resulting in the model not covering all use cases and data scenarios.
Data pseudonymization / data anonymization: The subset input production dataset for analytical model building is studied to shortlist the business data fields expected to be part of the analytics activity. From the remaining business data fields, sensitive, confidential, or personally identifiable information and quasi-identifier fields are pseudonymized. This ensures the privacy of data fields that are not expected to contribute to data analytics. Pseudonymization provides the additional benefit of being able to revert to the original data value (de-pseudonymization) should there be a business requirement (e.g., sending tailored deals to a customer based on their preferences. As a result, the customer’s name would need to be de-pseudonymized.) However, if there is no requirement to revert to the original information values, data anonymization can replace pseudonymization.
Differential privacy: Of the business data fields that are expected to be included in data analytics, sensitive numerical and categorical fields are subjected to differential privacy. This privacy-enabling technology masks original values with fictitious yet realistic values while ensuring that aggregated mathematical properties of numerical and categorical columns stay intact. The resultant differentially-private dataset can be consumed for analytical model building. One may need to balance data’s utility and privacy with differential privacy. Adding more noise to data may improve privacy but lessen the “usefulness” of the dataset in deriving meaningful insights. Furthermore, this process can be a bit iterative.
Model building for synthetic data: As an optional step, for even higher data privacy, the differentially-private dataset can be input to a special category of AI algorithms called “Tabular Generative Adversarial Network” (TGAN). The differentially-private dataset can be the training dataset for the TGAN algorithm, which creates a TGAN synthetic data generation model.
Synthetic data generation: Using the TGAN model created in the previous step, synthetic data can be generated. Such model-based synthetic data is expected to be close in terms of look-and-feel to the differentially-private input training dataset, which itself is expected to be close to the original data. One must note that TGAN-based data generation is evolving on the path of maturity. There is a possibility that the generated synthetic data may not always mirror the real production data to the degree that one desires.

In this way, multiple privacy-elevating technologies, such as data subsetting, pseudonymization/static data masking, differential privacy, and TGAN, are brought together to craft realistic training data for building analytical models.

Here is a fictitious example to illustrate the idea:

Figure 2: Enabling privacy-safe data for analytics

Benefits

Benefits of the proposed approach

The above approach presents the following benefits to an organization, seeking to ensure the privacy of data, used for data analytics:

Production data raises privacy risks, though it may provide more meaningful predictions and insights. Conversely, using purely synthetic data reduces data privacy risk but may lower the relevance of predictions and insights. The approach described above attempts to strike a balance between data utility and data privacy.
While TGAN is a data-privacy-enabling technology, it needs production data as input, to build the TGAN model for synthetic data generation. Some organizations that house extremely sensitive and personally identifiable data will not be willing to share data in its as-is state. This proposed approach imparts a certain level of data privacy, to the input data, which is used to build model for synthetic data generation.
Synthetic data generated with this approach honors referential-integrity requirements and makes data privacy-safe.
While this approach presents a specific sequence of data privacy measures, user can explore interesting combinations of steps and re-sequencing that best suit their use case.

Considerations for the approach

As applicable to any data analytics process, the proposed approach is subject to the following considerations:

It would be beneficial to have as much input data as is possible and is allowed for the given use-case, while respecting the principle of Data Minimization. The higher the volume of input data, greater will be the coverage of various real-world data scenarios and lesser the bias in the data.
Data privacy-enabling technologies of Differential Privacy and TGAN are known to work best on numerical and categorical type of business data fields.
Each intervening technology, of this five-step approach, has certain functional and technological considerations of their own. For a given requirement of building privacy-safe data for analytics, one must exercise caution and judgement and may use a combination of these technologies to achieve optimum outcome.

Remediation options

There are some nascent technologies to mitigate the risk of analytical data models based on actual production data, e.g., model unlearning and model disgorgement. Such technologies aim to remove the effect of a certain set of data from an existing machine-learning model. Such technologies are still evolving and have yet to pass the scalability, robustness, and cost-effectiveness test. Until then, organizations would fare well to ensure that at least the models being newly created use privacy-aware data.

Conclusion

Privacy-by-design approach for data that is used for analytics

Bad actors can misuse the power of mathematics to render analytical models susceptible to data privacy exposure. However, it is mathematics again that provides counter-tools that help "remove traces" of certain dataset records, which may have been used when building an analytical model. Technologies such as "machine unlearning" and "model disgorgement" are being considered to eliminate the effect of a specific type of data record that may have contributed to building a given model. Undeniably, these potent antidotes help reduce the risk of data privacy exposure emanating from analytical models. However, a more "privacy-by-design" approach would be to take care of data privacy in the initial stages of data analytics. The five-step approach presented in this paper, though certainly not a panacea, could help in this direction.

About the authors

Sumeet Bhide

Sumeet Bhide is a Data Privacy Analyst in TCS Global Privacy Office.

Write to me

Nikhil Patwardhan

Nikhil is a senior consultant and TCS MasterCraft engineering lead in data privacy, and data management. He has been involved in architecting, building products for protecting customer data in test environment as well as production. He acquires hands-on technology experience and manages large product teams.

Write to me

Contact

TCS is here to make a difference through technology.

We’re in it for good, driving positive change for the benefit of all.

Extraordinary expertise leads to remarkable results.

Want to be a global change-maker? Join our team.

Find the latest news about TCS in our Newsroom

Recent Press releases

Recent News

Recent recognitions

Upcoming events

TCS works hand in hand with world-leading investors.

TCS is here to make a difference through technology.

We’re in it for good, driving positive change for the benefit of all.

Extraordinary expertise leads to remarkable results.

Want to be a global change-maker? Join our team.

Find the latest news about TCS in our Newsroom

Recent Press releases

Recent News

Recent recognitions

Upcoming events

TCS works hand in hand with world-leading investors.

Unlocking privacy-safe data analytics: A five-step approach

Solution

Highlights

On this page

Data for analytics

Need for large volume of data for analytics

Need for privacy

Data privacy challenge in data analytics

Benefits

Benefits of the proposed approach

Conclusion

Privacy-by-design approach for data that is used for analytics

About the authors

Sumeet Bhide

Nikhil Patwardhan

Transformation starts here

Find out more

TCS is here to make a difference through technology.

We’re in it for good, driving positive change for the benefit of all.

Extraordinary expertise leads to remarkable results.

Want to be a global change-maker? Join our team.

Find the latest news about TCS in our Newsroom

Recent Press releases

Recent News

Recent recognitions

Upcoming events

TCS works hand in hand with world-leading investors.

Solution

Highlights

On this page

Data for analytics

Need for large volume of data for analytics

Need for privacy

Data privacy challenge in data analytics

Benefits

Benefits of the proposed approach

Conclusion

Privacy-by-design approach for data that is used for analytics

About the authors

Sumeet Bhide

Nikhil Patwardhan

Related reading

Transformation starts here

Find out more

Accessibility Adjustments