Anyone in business is aware that the accuracy and latency of the data driving their decisions is key and that trust in the process that presents the data they work with is essential. However, the majority of data specialists, analysts, data scientists and data quality managers are not confident they can bring trusted data at speed to their organisation.
So what do you need to understand in order to get the best performance from data and to be confident that the data you use and touch is up to the standards required?
Will combining data help me get better outcomes?
It all depends on the purpose for which the data is going to be used. Part of the problem comes from the volume of data that organisations are dealing with and the fact that this data has been acquired and collected in different places, at different times, with varying degrees of accuracy.
For instance, data which is very fleeting, like clicks or social media interactions, is very valuable to make an instant decision online, but is less predictive in the long-term when merged with more stable structured or personal data held by the organisation. It is tempting to give overdue significance to online data as the volume of data points can be so much higher than more persistent data, such as income or where someone lives. However, the half-life of a click is much shorter and is representative only of a single snapshot decision, rather than being a major lifestyle descriptor.
As growing numbers of disparate data sets are combined, confusion as to how much weight should be given to the different sources grows and the overall level of confidence diminishes. It gets even worse once modelled data is thrown into the mix.
Predictive models can be immensely useful, often making very accurate predictions or guiding knotty optimisation choices,. But if confidence in the underlying data is low, the likelihood of benefiting from a decision being made is also low.
How do I predict with confidence?
When used in the right way, big data and predictive models can help overcome bias, make great predictions or guide difficult optimisations. The best course of action is to be confident of the data you are using in the first place and to reduce the time that your teams spend time defending data or re-validating it.
Applying confidence scores to increase trust
A confidence score is a simple figure that indicates the confidence level of that piece of data, ie, how accurate it is. Traditionally, confidence scores described the provenance of the data – if it is known where the data came from and if it is fully verified. As data has become more complex and unstructured, confidence scores have also become more sophisticated, incorporating several factors that help to establish the reliability of the data, such as:
- System integrity – how many systems have the same data value?
- Governance – is the data compliant?
- Completeness – is the data complete or incomplete?
- Security – is the data safe from breach and loss?
- Timely – how old is the data?
- Accuracy – what is the probability that the model output is accurate?
These factors help to paint a more detailed picture about the data subject and there are many more factors that can be applied. Additionally, each can be weighted in terms of their relative importance to the overall score.
The problem with modelled data is that, by its very nature, it is modelled. Errors and inaccuracies can creep in making it at best useless and, at worst, a dangerous tool in business decision-making. That is why confidence scores are crucial to today’s modelled data attributes.
The impact of training data on confidence scores
There are two main pitfalls when creating model accuracy and confidence scores, both due to training data. The first is when there isn’t very much training data available. The second is when there is an abundance of training data available, but it is skewed or not representative of the data to be predicted.
If this is the case, there is a significant risk that the model will overfit and produce high confidence scores for inaccurate predictions. This is because the scoring population is inconsistent with the training population. It’s like creating a model to identify oranges and using it to predict apples.
To mitigate the risk of small training data volumes, a good knowledge of mathematics is key to selecting the right statistical approach which sets upper and lower confidences levels that are reflective of volatile data. However, the solution to the second issue is more complex than it might seem at first.
It is crucial to create a process that ensures the test data is representative of the training data and vice versa. In recent times, the flood of data has removed the need to be strict with confidence scores and boundaries. However, when modelling on skewed data this discipline is still imperative. Training and test data must be calibrated to remove bias.
You may also be interested in reading Six factors for modelling with confidence in 2019
How do I explain what my data does?
Having a clear overview of how data is being used and what value each attribute is delivering helps people across the organisation to measure the likely impact of the data to drive outcomes and make more informed choices.
The role of confidence scores in explainability
For many businesses, trust is just half the battle. Explainability is also key. If brands are to make important decisions around pricing, qualification and risk using data science, they have to be able to understand how models achieved the scores they have and how accurate are the models themselves.
Under GDPR companies are now required to know the provenance of the data they hold and process. Furthermore, consumers have the right to know exactly what their personal information is being used for and why decisions have been made as a result. Given that customer data is a primary source of fuel for the algorithms constructed using machine learning, organisations have a legal responsibility to understand these models.
Explainability is a case of plain and simple ethics. Understanding a model to anticipate any unintended consequences and potential bias that could impact on vulnerable customers (or indeed any customer) is morally the right thing to do. Many organisations have come under fire for biased models, such as Amazon’s advanced AI hiring algorithm which was discovered to favour men heavily for technical positions. As models are becoming more complex with larger numbers of features and increased feature engineering, explainability becomes more of an issue, which is why confidence scores are business-critical.
Clearly, the creation of confidence scores can often be just as complex as building a predictive model.
Bolstering trust in data?
When done well, combining vast amounts of data from multiple sources to create increasingly sophisticated algorithms will improve corporate performance. The key is to make sure that the methodology and approach taken within your organisation is robust. This means:
- Being acutely aware of the provenance and purpose for which disparate data is collected and used.
- Attributing confidence scores to all data attributes as this helps the organisation’s ability to make models more predictive.
- Ensuring the data models are fully explainable to underpin a business process and build trust with both the consumer and regulators.
This will all involve greater rigour and investment. But in the long term it will have an important impact on your business.
Published by DataIQ
If want more information on how and why you should use Confidence Scores to aid decision making, our White Paper will be an interesting read for you.