From errors in travel, retail and banking, training AI on inaccurate data can produce disastrous results. In healthcare, one example of a particularly poor outcome is when a model trained on inaccurate patient data returned erroneous cancer treatment advice.
Unsafe treatment advice may be an extreme example of the potential negative consequences of AI being trained on bad data, but it demonstrates the reality that biased outputs, inaccurate predictions, an inferior customer experience and reduced trust are all issues that may arise when starting out with inaccurate data.
A previous post in this series on the pillars of data readiness focused on the importance for data to be complete. Here, the attention shifts to the importance of accuracy as the second of the six pillars of data readiness (complete, accurate, timely, actionable, trusted, compliant.)
What is accurate data as it relates to data readiness? Accuracy means that data is free from input errors, matching errors, identity errors, miscalculations, duplicates, misclassifications, and a range of other data issues. In essence, accurate data correctly represents the individual, household or business that a business is trying to understand.
What it Means for Enterprise Data to be Accurate
Data being complete and accurate are often linked together in terms of the process of data readiness. That’s because both are foundational for making sure that data is right. It is of course possible to have one without the other, with a simple example being collecting data of every type and from every source but being unable to extract value due to poor data quality processes. Or, conversely, to have perfectly clean (read: accurate) data that is really narrow in scope and misses key data sources, transactional and/or behavioral information.
While both can have a negative effect on business and CX use cases, accurate data is important to keep the data readiness process moving forward. Data accuracy is a critical upstream activity in making sure the data is right – it is cleansed, matched and householded with precision so that you can confidently build out your high value use cases.
Data Accuracy and Cleanliness
Accurate data reflects the truth about a customer or entity, having gone through standardization and normalization processes to remove errors, resolve conflicts, and ensure consistency. Ideally, the full scope of data quality processes take place at data ingestion when building unified profiles – ensuring that the matching process accounts for all pertinent data and data is continually correct from there going forward. Correcting data input and formatting errors, missing fields, duplications, incorrect values (pounds vs. ounces, etc.), and validations are among the common standardization and normalization tasks. Others include flagging or removing outliers and ensuring categorical consistency (Road vs. Rd.).
The Six Pillars of Data Readiness: Accurate
The data cleansing process is indispensable for precise matching. The reason why data quality needs to be taken care of at data ingestion is because that’s when the enterprise begins to develop a comprehensive, contextual understanding of a customer (household, etc.) with the building of a Golden Record. If data cleansing is put off until activation, or if it is left to various point solutions, what’s lost is a single view of the customer. If cleansed at activation, recency becomes another problem; the customer may have new interactions and behaviors that do not become part of a match. If left to point solutions, data cleansing essentially becomes a free-for-all exercise without standards or integration.
Data Accuracy and Precise Matching
Identity resolution is about accurately detecting patterns to determine whether different facets or signals constitute the same customer, person, household or organization. By using the right combination of deterministic and probabilistic matching, an organization considerably improves the likelihood of success of engaging with the right customer or household, and for training AI models with data representative of a company’s actual customers.
The better the matching process, the better a brand can accomplish business, CX and AI use cases that depend on having a solid customer understanding. Creating a unified profile with basic deterministic matching –exact matches of records with hard identifiers – is an important, albeit partial, element of an accurate match and an indispensable component of advanced identity resolution.
Because the goal of identity resolution is to gain a better understanding of a customer or household by reconciling different records for a customer across different engagement systems – different channels, different identifiers and/or with anonymous-to-known journeys – deterministic matching must be combined with probabilistic matching.
Probabilistic Matching
Probabilistic matching is a key step in resolving various identity proxies for a customer across different systems, multiple data sources and various data types. Probabilistic matching is also instrumental in providing context around a customer, i.e., the customer’s relationship within a household or a business. An example is tying two identity proxies to the same physical address.
Another key benefit of probabilistic matching is to resolve conflicting or unclear signals to produce a higher degree of confidence in a match than using deterministic rules alone. One example in a householding context is to have two identities – say a wife and husband – tied to the same address. Probabilistic matching uses other signals to determine which of the two is connected to any one transaction – and then links that transaction to an individual profile.
Advanced identity resolution using probabilistic matching also corrects for human errors. A call center agent, for example, enters “Jon Smith” on a form instead of “John Smith.” If other signals indicate the customer belongs to the same record, probabilistic matching will correct for the error where a basic deterministic rule will likely not.
The Role of Diversity in an Accurate Dataset
Accuracy also demands that data is diverse as well as representative, particularly as it relates to AI use cases. A brand may have data specific to customers belonging to a loyalty program that is uses to draw inferences about those customers. But while that data is representative of loyalty customers, it is not representative of all customers. Likewise, if a brand wants to introduce a new product designed for Millennials and GenZ but its data skews toward GenX and Boomers, inferences made from the dataset will not be accurate for the intended demographic. This is the classic polling problem – phoning only landlines and using the response set to make presumptions about the population at large.
A lack of accurate data has the potential to derail almost any CX, business or AI use case that depends on a precise understanding of a customer, household or business. Skewed AI results, incorrect/incomplete matching put a brand at a disadvantage when it comes to extracting value from customer data. Prioritizing accuracy in a comprehensive approach to data quality will ensure that a brand maximizes one of its most valuable assets – customer data.
While accuracy gets the ball rolling on the process of making data right and fit-for-purpose, the entire data readiness process consists of six distinct pillars. The next part of our series on data accuracy will focus on the importance of data being timely.