Data is a key growth enabler in today's fast-paced digital transformation space. In the current digital era, information is accessible to all, and data is pumped into systems daily. Every hour, we generate Petabytes of data. This incredible amount of available data brings considerable challenges in maintaining data quality. "If you can't measure it, you can't manage it," is an oft-quoted admonition on data management attributed to the late William Edwards Deming, an American statistician known as the guru of quality control.
Data is the most impactful lever for an organization's growth and transformation strategy and a catalyst for the enterprise to become future-ready regarding its operational maturity. However, just having data is not enough; you need quality data. Let's see why.
Assessing data quality is important. This assessment needs to consider the impact of technical attributes of data such as consistency, accuracy, completeness, timeliness, relevance, and the business implication for the industry to which the organization belongs. Therefore, a holistic view of data is required to arrive at recommendations to improve data quality. This amalgamation of technical and business attributes of data can be used to arrive at a data quality score.
In an enterprise ecosystem, data tells a story. The story of the lifecycle of a tuple (data record), the tuple or the record undergoes multiple transformations and enhancements, producing information that tells the user of the purpose of this dataset in the ERP system.
A classic example is the retail ERP system. It maintains a transactional record of the stock in hand of an SKU (Stock Keeping Unit) at any given instance. This information, in turn, helps in planning the replenishment of correct levels of stock. It allows buyers in on-time procurement, helps merchandisers set the right price for their inventory, and enables the supply chain manager to move goods across locations timely.
If we listen closely, the SKU data story helps answer questions like: What is the stock at the store at this point? When was the inventory updated? Where is it being kept, and how is the stock moving?
But what happens if the data parameters in this journey get corrupted? Will the data narrate the 'correct' story? Will the planners interpret it correctly, and will they be able to plan the inventory accurately?
Our research found that the existing data quality assessment tools talk only about the technical parameters. There is a need to view data holistically, incorporating both technical and business aspects.
We also see that the data quality journey doesn't end with recommendations. It has to be a closed loop by providing feedback to the system, thereby improving the overall data quality.
Based on our assessment of looking at data holistically, we arrived at a set of parameters to interpret data quality from two dimensions -- technical attributes and business implications.
We have devised a 12-point formula (see table below) for our data quality assessment method, which we call the Data Quality Assessment and Recommendations Tool (DQART).
To test our 12-point DQART formula, we applied it to the merchandising system of a leading retailer. The objective of the exercise was to investigate the data quality of the enterprise. Though the use case here is specific to the retail industry, the foundational precepts of the formula are also relevant to other sectors.
Retail organizations depend on multiple systems to deliver goods from the supplier to the customer. Items are the foundational building block for a retailer. Therefore, data quality principles must be rigorously applied while creating an item. For instance, often, item descriptions carry special characters (%, ^, &,*, $, #), which violate the item creation protocol. Such With deviations could have a ripple effect across the system and cause delays in data processing, impacting customer experience and reporting output of the decision support systems.
Retailers generally choose uniform prices across multiple differentiators (color, size, etc.). Our research shows that uniform pricing implementation across differentiators gets compromised during item maintenance. It impacts the data integrity of an item with multiple prices across different variants and affects the overall customer experience.
Figure 1: Reference data quality Inferences for Item functional area
From the above figure, we can conclude that the 'kids wear' department is creating items with 'special characters' and violating a business rule of setting consistent prices across multiple differentiators. To mitigate any future deviations, the IT team should advise the department to avoid using special characters and do the same across variants. Alternatively, the IT team should eliminate all special symbols during the initial upload/item creation process.
The journey of a data set is fascinating. We mapped the journey of real data sets from their raw form to the quality output, giving us multiple insights into how data quality can be improved.
We measured data quality not only on accuracy, consistency, integrity, timeliness, and relevance but also on the importance of business violations of the multiple rules applied to the data set. Our 12-point formula drills down to the minutest details of the data across the five critical technical attributes and put in a flavor of the business imperative of data. The formula provides a detailed series of steps to assess the data quality from a holistic point of view. The recommendations highlight the deviations across multiple data attributes and provide information on the leading practices to be followed. This tool can be deployed on-premises and on the cloud. An AI/ML framework to implement the data recommendations and create a closed-loop system, thereby pointing out the nuances and correcting them.