Aug 12, 2020

How to Avoid Data Distortion & Machine Learning Bias

While there are many sources and types of errors in your marketing data – data biases or distortions – one common trait is that the data inaccurately reflect either the customer you’re engaging with, or the market you are operating in. When data distortion seeps in, it can be hard to eradicate – in large part because data analysis based on flawed data often just reinforces false assumptions, making it less likely to doubt its veracity.

Data that is counter-factual, or data that is not representative of the real world, has the potential to damage brand reputation, skew marketing campaigns or otherwise steer marketers down the wrong path. For an organization to rid itself of this pernicious pest, it’s important to understand what data bias is, why it’s important to remove it, and the steps to take to achieve it.

What is Data Distortion?

Data distortion is the deviation of data from its true or most accurate representation of the full picture; bad data may interject incorrect or misguided “facts” into useful information or, worse, into models and predictions of customers or business. Internal sources of customer data, as well as data collected on the open market, have a history of being erroneous, incomplete or unrepresentative, omitting key variables and/or simply being out of date.

A customer may move and get married; using an old address or marital status after a life change is a common example of data distortion that will erroneously impact a customer record if the information is not updated. Similarly, a purchased data set might contain transactions and behaviors that no longer represent your customers. Imagine using online/offline purchase history from 2019 to predict customer purchase habits in the summer of 2020, when so many people have drastically changed shopping patterns due to Covid-19. Using data collected by some other group for some other purpose, or data collected from some unrepresentative subset of your customers – one timeframe, location, or customer cohort – to represent the breadth and depth of your customers and their affinities, desires and expectations is likely to give you poor results.

Categorization decisions made by humans can also cause distortion. Consider a marketing survey that wants to know the age of customers. If responses are presented as a range of options (25-34, 35-50, 51-64, etc.), how did the marketer arrive at those specific cut-offs? Are they random, or meaningful for the purpose of the data collection intended to form the basis of a campaign? If, for example, you’re trying to form meaningful correlations between income and age, it would be very important to know the age at which your customers are retiring for purposes of setting an age range that reflects that reality.

Other human-introduced biases may occur when humans capture original data, including situations such as a customer filling out a form. If time constraints prevent the form from being filled out completely, populating the final unanswered questions with a default answer may introduce too many default values into the data collection – which doubles as a data hygiene problem. Similarly, if a call center associate is asked to rate the sentiment of a caller, rankings may be skewed by the associate’s own KPIs and measurements on customer satisfaction.

Filtering also causes bias if data is filtered to align with assumptions – the reinforcing of false assumptions mentioned above. If, for example, you’re trying to market to your highest value customers, the way you devise a campaign – in terms of channel, type of outreach, tone, message, measurement, etc. – may all be based on your preconceived notion of what a high-value customer looks like. If machine learning models are built to find what you’ve already defined as a high-value customer, they’ll find exactly what you want – whether your presuppositions are right or wrong. Similarly, if you send one group 10 emails and another group five emails, and there’s a greater response from the group that receives 10 and you therefore deem them higher-value, the question becomes how did you determine the groupings in the first place? In the same vein, if analysis reveals results that do not reinforce your expectations, you may re-work the data, change a model, or otherwise massage the results until they better align with the assumptions you were making.

What are the Dangers of Data Distortion?

If we think about data distortion in terms of how it affects the customer, the organization and brand reputation from a public/regulatory standpoint, there are two key categorical reasons why biases or distortions should be removed – one consumer oriented and one organizational oriented.

From the point of view of the customer, customers clearly have an interest in a brand having an accurate, relevant picture of who they are. Any inaccurate data that has been collected will distort that picture, especially if it is used to build models, make decisions, create segments, target campaigns and otherwise form the basis for how you will engage with the customer.

From the organizational perspective, biased data will similarly skew results with potentially negative outcomes for corporate objectives, whether improving sales, efficiency, customer satisfaction, retention, loyalty or another KPI. Likewise, if campaign measurement introduces errors in measurement, attribution, grouping, or causation – intentional or not – marketers will have an inaccurate picture of a campaign’s effectiveness, and thus will not be acting in the best interest of the goals and KPIs they’re pursuing.

A second corporate interest in avoiding data bias pertains to brand reputation, as corporations have a vested interest in collecting accurate data because they want to behave appropriately for their customers. Any inaccurate representations of reality could hinder fair and equitable decisions by the company, which runs counter to making a corporation trustworthy in the eyes of the customers they’re trying to serve.

How Do I Remove Data Distortion?

There are five things a marketer can do to ensure machine learning models are free of statistical or sociological biases as described above.

The first, broad category for steering clear of data and machine learning bias is to build accurate and careful data collection processes. Make sure your ecommerce site, customer surveys, loyalty memberships, and call center forms all collect data that are consistent, reasonable, and easy for customers and agents to enter. This can help your organization avoid introducing errors or distortions into its own first-party data.

Second is to strive to identify and compensate for biases in third-party data (text analytics, facial recognition, credit scores, sociographics) that are used as variables in your models. In other words, do not take for granted that the judgments and values assigned to the data you’re collecting match your own understanding.

Third, where possible, build in “explain-ability” in models by documenting training data and decisions, determining primary variables and weights for models, and using less obscure algorithms (such as decision trees) where possible.

Next, it’s important to experiment, measure and optimize models to ensure they make measurably good decisions, with a requirement for some independent assessment and tools to define “good” in a value-neutral and explainable way.

Finally, you can put in place processes and technology to correct errors and respond to customer inquiries. GDPR and CCPA preference centers, Master Data Management Stewardship software, and Data Quality components for your customer data management all provide means to improve your customer data, help customers fix (and trust) your data, and help your organization turn privacy/compliance challenges into customer engagement opportunities.

Don’t Let Data Distortion be Intimidating

Removing errors and biases from machine learning models and the overall way business is conducted aligns an organization well with its internal needs, its expectations for being profitable and for being a “good citizen”.

Being cognizant of data collection pitfalls and being proactive in addressing them has many positive payoffs. Because data bias is not usually intentional, it is often perceived as impossible to avoid, as well as to completely remove, by virtue of its being hard to detect. But putting appropriate mitigation measures in place make creating a complete, accurate and representative data set a far less intimidating problem than at first glance; it’s just one of many data management tasks that are necessary to align strategy with execution. Putting those measures in place may be difficult, but it’s certainly less of a hardship than allowing bad data to go unchecked.

Why Data Veracity is the Foundation for a Personalized Customer Experience

4 Steps to More Efficient Data Preparation