Just one-third of 400 surveyed CEOs responded that they trust their data analytics according to KPMG’s 2016 CEO Outlook article. This is an astonishingly low rate that the decision science industry should take as a wake-up call to shine light on their processes. Auditors should also take note because most of these CEOs are from companies that have invested heavily in data analytics.
I set out to write this post on some of the top reasons I have seen that analytics may not be reliable, which quickly turned into a multi-page paper on IT auditing. This blog is intended to be brief commentary, not forum to post white papers. Therefore we will split the overall topic into a series of articles focused on individual components that will be easier to digest.
To start the conversation, we will explore one of my favorite topics to discuss with data scientists… clean data is not reliable data. The ugly truth is that the term “clean data” as it is used in the analytics field refers to a process that completely neglects evaluation of data completeness. Wikipedia defines data cleaning as the process of detecting and correcting (or removing) corrupt or inaccurate records from a data set, table, or database. It’s hard to have a complete set of records if any are removed. Let’s explore the data cleaning process in more detail… The Certified Analytics Professional exam study guide defines “data cleaning” to include: (summary of page 41)
- Identifying and filtering out invalid values;
- Identifying inconsistent data encodings and recoding them using standard values;
- Identifying and removing outliers and suspicious values; and
- Identifying suspicious interrelationships between fields to remove duplication.
- Identifying null values and deleting the record or imputing a new value;
Again, how can one have assurance in an analytic dashboard or report if it is built on data with records removed by some data geek in a complex script written in Python, R, SAS, etc? Let’s explore with an example. Below is a table of data from a hypothetical customer database:
As you can see the last two rows have problematic purchase dates. A typical data cleansing process allows both of these records to be deleted from calculations, which would result in the following summary table, which omits $249.98 in orders:
Because the third row includes a valid date and invalid month, A second analytics professional may treat the “dirty” values differently because the third row includes a date in 2017, such as the following aggregation:
The erroneous calculations could also extend into percentages, calculations of variation, and predictive models about the success/failure of Promo A or B. In practice analytics professionals deal with this issue on a much larger scale. If analyzing millions of records, it is simply not often feasible to individually examine erroneous data recordings. Regardless of the size of data, filtering out ANY transactions will affect calculations to some extent.
So what is a better solution that different analysts can consistently apply in both these scenarios? Let’s explore!
When considering the quality of data, the term “reliable data” refers to a state of data being relevant, complete, accurate, and timely. IT auditors have wrestling with for decades and the Government Accountability Office published a report titled “Assessing the Reliability of Computer-Processed Data” that we can use in this situation. The report describes a methodology for considering the internal controls environment surrounding the source data. If controls surrounding data are not reliable, all calculations and conclusions made using that data should also be communicated to executives and managers as being unreliable. For specific calculations we can easily quantify the extent of the problem.
We know the count and amount of observations before data cleaning, so we can add easily add a metric describing the amount of known error. Extending the customer promo example we used above, we could report the following:
The reliability metric is calculated as the percent of all recorded values included in the analysis. In the example above, this is ($199.99 +$199.99)/($199.99+$99.99+$99.99+$149.99) = 399.98/549.96 = .7273.