How confident are you about your modeled data?

If you reply honestly, the answer to the question is likely to be akin to sticking your finger in the air and seeing which way the wind is blowing. The problem with modeled data is its very nature – it’s modeled. Therefore, errors and inaccuracies can creep in, making it at best useless, and, at worst, a dangerous tool in business decision-making. That is why confidence scores are crucial to today’s modeled data attributes.

In order to trust and use data science and modeled data, both the science and the data need to be transparent and explainable.

If brands are to make important decisions around pricing, qualification, risk and more using data science, they have to be able to understand how models came to the scores they have and how accurate the models themselves are. It is vital for communicating with customers and regulators alike.

Let’s take the insurance industry as an example. Confidence scored data gives autonomy to insurers to create their own thresholds when making nuanced judgements around pricing or the customer journey. Companies can decide themselves between a more disruptive but thorough customer journey or automated form fill when creating policies. Specialty services can tailor models to these variables with full transparency into the quality of the data and the risk they are facing.

However, there are two main problems in creating accurate confidence scores on modeled insurance data. The first is when there isn’t very much training data available. The second is when there is an abundance of training data available, but it is skewed or not representative of the data to be predicted. If this is the case, there is a significant risk that the model will produce high confidence scores for inaccurate predictions because the scoring population is inconsistent with the training population. It’s like creating a model to identify oranges and using it to predict apples. That the model has good confidence in its ability to predict oranges simply isn’t applicable.

To mitigate the risk of small training data, a good usage of statistical methods/approaches/tests (and distribution assumptions) to select upper and lower confidences reflective of volatile data is key. However, the solution to the second issue is more complex than it might seem at first. To combat it, it is crucial to create a process that ensures the test data is representative of the training data and vice versa.

In recent times the flood of data has removed the need to be strict with confidence scores and boundaries, however when modelling on skewed data this discipline is still imperative. Training and test data must be collaborated to remove bias.

Given these challenges, creating confidence scores can often be just as complex as creating your predictive model.

It requires judgment, statistics and experience. Moreover, accurate confidence scores are vital when providing data that will underpin business processes and an important part of building trust both with consumers and regulators.

This article was published in Analytics magazine, September 2018.

Related Posts