In 2020 alone, more than 155.8 million individuals were adversely affected by data exposures or unintentional leak of sensitive information in the US. Any information that requires to be protected and guarded from unauthorized access is classified as sensitive data. These can include personal information (PII) such as social security number, health data (PHI), financial details like bank account numbers, card holder data (PCI), customer data, trade secrets and patent worthy information.
According to the Federal Information Processing Standards, the sensitivity of data can be measured by its confidentiality or privacy, integrity or accuracy, and availability for use at any point of time. Protection of such sensitive data needs to be considered through its life cycle of discovery, monitoring, masking and de-identification.
The COVID-19 pandemic has triggered some disruptive trends such as borderless workspaces, clinical trials, telemedicine, supply chain globalization, return to work, need for real-time access to data and overall accelerated digital transformation. These are driving the generation of massive structured and unstructured data including logs that require discovery and efficient management to derive useful insights. Other trends such as mergers and acquisitions and product launches also contribute a 16.1% CAGR of the data discovery market, which is likely to grow to $12.4 billion by 2026. Common challenges in this process include disparate sources of data, different kinds of structured and unstructured data, lack of data democratization and organizational silos, quality of data, absence of catalogs, and large amount of relevant data existing outside the organization in the larger ecosystem.
Data management in the age of big data, data lakes and self-service is challenging. Data catalogs help in organizing the sensitive data from various sources. They provide context to the data with reference to source, structure, quality, lineage and usage by linking sensitive data with their meta data. Cataloging also helps in data classification and understanding the specific fields with sensitive data that need to be masked or encrypted. While a number of paid or open source data catalog tools are available, the lack of expertise in deciding the right one for your business, lack of knowledge of best practices while deploying a data catalog, scalability of the tool, requirement for any additional plug-ins, license terms and lock-in period can become hurdles. Not to mention the need for security measures such as identity access management policies to govern access for sensitive data.
Monitoring of sensitive data is important to understand it and derive insights. Once existing data across various sources is discovered and cataloged, it is essential to monitor sensitive data from new and incoming data to ensure that their integration with the existing data and governance such as masking and encryption are seamless. Exfiltration by stealth attackers also call for close monitoring of sensitive data by enterprises. Expanding borderless work and ecosystem perimeters challenge sensitive data security and call for stringent monitoring.
Sensitive data needs to be de-identified even before it enters the data lake for you to to derive insights or new business opportunities. If an enterprise uses sensitive data without de-identification, it can be penalized for non-compliance. An architecture like de-identified data lake by design can ensure that enterprises remove sensitive information from data even before it enters their data lakes. This also enables data sets to be available for sharing and reusing in a selective and controlled manner.
AI-driven and automated tools can help protect large volumes of data through these four stages of its life cycle.
Automate discovery of sensitive data: Techniques such as machine learning and pattern matching support data discovery tools to prepare detailed reports of sensitive data using built-in criteria. Automated sensitive data discovery, logging and reporting in AWS storage buckets is implemented with the help of Amazon Macie.
Simplify classification and management of data: A data catalog is a key component for data governance, data quality and analytics. Data catalog using AWS Glue Data Catalog contains references to data which acts as an index to the location, schema, and runtime metrics of that data and thus simplifies sensitive data classification and management.
Monitor data logs and events: Enterprises must define identity and access management (IAM) for user accounts on cloud. The permission policies for these roles allows discovery tools to monitor resource access. In AWS environments, Amazon Macie generates and maintains a complete inventory of your Amazon Simple Storage Service (Amazon S3) buckets. Amazon Macie logs for sensitive data discovery jobs, and how to monitor the events using Amazon CloudWatch.
Protect sensitive data: AWS has designed a de-identified data lake (DIDL), an architectural approach that helps solve the data privacy problem by de-identifying and protecting sensitive information before it even enters the data lake. A DIDL on AWS helps discover, identify, catalog, monitor and protect your data. Amazon Comprehend Medical and Amazon Rekognition are two native services used in life sciences industry to de-identify medical images.
In order to protect sensitive data, enterprises need to leverage automated tools to discover data, implement catalogs to classify and manage data, define IAM to monitor data and adopt de-identified data lake to protect data.
Learn more on how to manage and derive meaningful insights from data on AWS cloud.