The process of turning data into information to present it in a simple manner can be incredibly complex. I believe this irony is primarily because most available data is not formatted for analysis. Building a large, custom data set with the exact list of features you desire to analyze (Design of Experiments) can be very expensive. If you have pockets as deep as big Pharma or are ready to dedicate years to a PhD, it’s definitely a great way to go.
Our last blog on trusting data analytics explored how the industry practice of “data cleaning” can spoil the reliability of an entire analysis. But problems can also occur with perfect, clean, complete, and reliable data. In this post we will explore the topic of data provenance and how the complexities of data storage can sabotage your data analytics.
The truth is… business data is structured and formatted for business operations and efficient storage. Observations are usually:
- Recorded when it is convenient to do so, resulting in time increments that may not represent the events we actually want to measure;
- Structured efficiently for databases to store and recall, resulting in information on real world events being shattered across multiple tables and systems; and
- Described according to the IT departments’ naming conventions, resulting in the need to translate each coded observation;
These factors really complicate our job as decision scientists. To compile information on a real world issue, we are required to reverse engineer all the factors of how our data is stored and recorded. (I want to send a quick shout out to OLAP data cubes, which certainly alleviate the hurdles of data prep if your objective is to slice, dice, aggregate, filter, or pivot the data. But in my experience OLAP cubes still take significant manipulation to perform other analytic methods like regression modeling, decision trees, neural networks, etc.)
Let’s assume all our data is complete, accurate, and clean. Let’s also assume all the technical work is done correctly. I’d like to highlight a very common problem with the process finding, extracting, and joining data using a very simple example to introduce the concept of “data provenance”. In this example we have a customer table and we want to count the number of new customers per day. We find the relevant table (in our example there is just one), summarize only the data we need, and create a histogram showing the number of new customers per day. This is illustrated below:
Data provenance refers to the process of tracing the origin of data back to its source. The word”provenance” stems from the French word “provenir”, which means “to come from”. The primary purpose of tracing the provenance of a variable is to provide evidence of its origin. Therefore, the data provenance process is central to the validation of data in an analytic product. Both decision scientists and auditors should incorporate some version of these steps into their work.
Refer again to the last image with our common understanding of data provenance. Notice how the Customer Table has multiple dates? Which of these did the analyst select and extract in the “Date” field? The chosen field would almost certainly change our results as visualized in the bar chart.
Let’s assume we selected the “Last Purchase” field and renamed it AS “Date” in a database query. None of the dates are named “Customer acquisition date”. Did we select the correct date? Would this information be important to know when interpreting our simple graph?
Factors to consider in validating data provenance include if the origin and transformation of all variables that influence the model results. In our hypothetical example we did the research to learn that all records with 1) a null value for “Last Update” and 2) the same date recorded in the “Last Login” and “Last Purchase”. This makes sense. A new user would not have any need to update his or her records. A first time customer would have a date for making a purchase, and logging in.
A better solution is to embed the source, documentation, and date of extraction into the final format that you will use to report. I term this a “data birth certificate”. With so many new terms in the field of analytics, why not? OK, a quick internet search shows that “data citation” might be an applicable term so I’ll move forward with it instead. This data citation provides executives, analysts, auditors, or any other interested party with critical information to validate the relevance the fields and sources on which an analytic product is based. Let’s update our example accordingly.
Now our Cleaned query includes comments with the date we extracted data, its source(s), and the file name of the script/procedure/functions with details of the entire data provenance. Now our Model Building steps have sufficient detail to find our source, verify the names we used, and validate our transformations.
To explore the topic of data provenance more, I suggest: