Why Audit and Data Science Should be Married

I read a highly rated question/answer on Quora about Key Performance Indicators for a data science team (https://www.quora.com/What-are-the-best-KPIs-for-Data-Science-team) and couldn’t help but to respond out of empathy and frustration.  Many respondants recommend KPIs related to one of the biggest obstacles many decision science project… getting access to the data.

It boggles my mind why organizations put so many resources and strategic focus on data-driven decisions, but don’t give their own teams access to the data they need to do their job.  The openly-biased opinion of The Analytic Auditor is that the audit function is a perfect solution.  Auditors (particularly Internal Auditors) hold a permanent spot in the organization that demands several factors that are pivotal to starting any decision science project.  These include:

  1. Access to executive sponsors that drive data projects,
  2. Established processes for handling sensitive and protected data, and
  3. Durable access to data and systems across the organization.

No other unit across the organization enjoys all these benefits.  Not directors, not consultants, and not even database administrators!  Those of us in an internal audit or similar role should use every one of these factors to our advantage!  You can get started by listening to executive concerns, requesting access to every system we can, diving deep into data with every tool we have at our disposal, and delivering insights!  And this article focuses on auditors’ advantages for starting decision science projects.  Our advantages continue into the data science process.  More to come, please subscribe to make sure you receive future posts on how that occurs!

Beyond Desktop Databases

We recently spoke with a experienced auditor  about how his organization has made analytics its top audit-related priority.  This is a very reasonable decision given that work can be automatically executed, documented, and continuously recycled as a “continuous audit” procedure.  The efficiencies are quite appealing to auditors who perform much of their re-work on a quarterly/annual cycle.  Ironically, my colleague noted their primary tool is Microsoft Access (R), which is capable of none of the most valuable benefits.  Let’s examine… Access (R) is limited in how it cannot natively:

  • Produce sharable analytic procedures to import, clean, and transform data;
  • Automatically execute analytic procedures to perform work without human action;
  • Produce logs of multiple processing steps that serve as audit documentation; and
  • Restrict access to private information.

So where is an inspiring analytic auditor to start?

sql

What tools can an experienced decision scientist recommend developing analysts?  We feel SQL is a valuable first step in any analytics career.  SQL (Structured Query Language) is so pervasive that the International Organization for Standardization (ISO) has codified it.   In today’s digitalized world with massive amounts of data being gathered every day and stored into a database, knowing how to query and program with SQL is the most useful tool we can imagine for an analytic auditor.  Lots of people use it, so it’s a transferrable skill.  Furthermore, SQL solutions are strong in many performance areas that are key to analytic auditing, including:

  • Connect to multiple SQL data sources, which is a popular platform for operational data;
  • Produce scripts that perform multiple processing actions and can be shared among different individuals and retained as audit documentation;
  • Provide for access controls to databases, tables, and individual records.

There are multiple “flavors” of SQL, it is used by Microsoft SQL, Oracle, MySql, Amazon’s Redshift, and many many other popular platforms.  Each of these solutions uses a slightly different version of the SQL language because each product has custom functions they have developed to differentiate their products.   But the good news is, these functions are not necessary to perform all of the basic steps in the analytic process.  If you’re organization uses a type of SQL, then we suggest you begin using it and almost all of the skills you learn will be transferrable to the other solutions!  The most important decision is the decision to begin using SQL if you are pursuing a career in analytics.  Learning is not supposed to be comfortable, so just get started! To help you on this journey, we’ve compiled some useful resources:

For more free and valuable content, subscribe to this blog on the top right.

About the Certified Analytics Professional (CAP)

About four months ago I decided to take my passion for decision science to a new level by pursuing the Certified Analytics Professional (CAP) certification.

CAP Logo

Coming from a non-technical background, some people (particularly those with computer science backgrounds) were skeptical of my knowledge and abilities working with large amounts of data and writing predictive models.  (Ironically, one of the same data scientists with a heavy CS background inspired a separate post on the pitfalls of common data cleaning procedures.)  I feel a relevant certification is a great way to give others confidence in my foundation of knowledge in data analytics.

The CAP seems to be the best branded, most well recognized, and best sponsored option for data science related certifications.  In a July 2014 article titled 16 big data certifications that will pay off in CIO magazine, the CAP exam was listed as the first item on the list. Continue reading “About the Certified Analytics Professional (CAP)”

Should You Trust Analytics II: Data Provenance

The process of turning data into information to present it in a simple manner can be incredibly complex.  I believe this irony is primarily because most available data is not formatted for analysis.  Building a large, custom data set with the exact list of features you desire to analyze (Design of Experiments) can be very expensive.  If you have pockets as deep as big Pharma or are ready to dedicate years to a PhD, it’s definitely a great way to go.

Our last blog on trusting data analytics explored how the industry practice of “data cleaning” can spoil the reliability of an entire analysis.  But problems can also occur with perfect, clean, complete, and reliable data.  In this post we will explore the topic of data provenance and how the complexities of data storage can sabotage your data analytics.

Data Provenance 2

The truth is… business data is structured and formatted for business operations and efficient storage.  Observations are usually:

  • Recorded when it is convenient to do so, resulting in time increments that may not represent the events we actually want to measure;
  • Structured efficiently for databases to store and recall, resulting in information on real world events being shattered across multiple tables and systems; and
  • Described according to the IT departments’ naming conventions, resulting in the need to translate each coded observation;

Continue reading “Should You Trust Analytics II: Data Provenance”

Should You Trust Analytics III: Analytics Process

Lack of trust in source data is a common concern with data analytic solutions. A friend of mine is a product manager for a large software company that uses analytics for insights into product sales. He told me the first thing executives and managers do when new analytic products are released in his NYSE-traded, multi-billion dollar  company is…  manually recalculate key metrics.   Why would a busy manager or executive spend valuable time opening up a spreadsheet to recalculate a metric? Because he or she has been burned before by unreliable calculations.

I’ve been exploring the subject of unreliable data since a recent survey  of CEOs revealed that only 1/3 trust their data analytics.   I have also been studying for an exam next week to earn a Certified Analytics Professional designation  to formalize my knowledge on the subject.  While studying each step in the analytics process on INFORMS’ analytic process, the sponsoring organization for the Certified Analytics Professional exam, I’ve considered how things could go wrong and result in an unreliable outcome.  In the flavor of Lean process improvement (an area I specialized earlier in my career), I pulled those potential pitfalls together in a fishbone diagram:

Analytic Errors Fishbone

Continue reading “Should You Trust Analytics III: Analytics Process”

Audit Standards and Data Analytics?

While giving a presentation on Analytics during a recent event, one of the meeting participants asked how the Audit industry felt about data products created using Analytic processes.  On first thought, I consider Analytics to be a form of “analytical procedures”.  This was my response but I had to qualify it by acknowledging that I wasn’t sure how different auditing standards addressed the topic.  Over the last few days I’ve been able to do some research and pull together a quick synopsis of how the most commonly used Audit standards define the work behind Analytics.  In summary my initial impression was pretty close… several of major Audit standards define this type of work and emphasize the reliability of data that underpin Analytic data products.

Analytic Procedurs Graphic Continue reading “Audit Standards and Data Analytics?”

Quit Sampling!

This post may disturb some “old school” auditors.  In fact, used to be a self described old school auditor.  If I couldn’t find a work paper in the permanent file, someone was going to get an earful about their ability to keep reliable documentation.  But the business sector has evolved at a frightening rate since those days.  Surprisingly, many audit professionals t-teststill consider sampling and testing to be their go-to procedure.

I’m not knocking the old student’s t-test.  It’s still the right tool for some situations.  But it’s been harder and harder for me to find those situations in recent years.  Most of the subjects I audit now can easily be scrutinized using data mining, or even scripted into an automated monitoring report.  Continue reading “Quit Sampling!”

Journal of Accountancy on Analytics

While trying to get this blog to appear higher in internet searches, I ran into an interesting article on the Journal of Accountancy discussing Analytic Auditing.  The article is written from an external audit perspective and focuses on two benefits.  First analytics provides auditors with greater insights into their clients’ business that help to quickly get up to speed on the external audit customer’s business model.  Also the article mentions how analytics provide better service to Continuous Auditing Processclients.

The article also states a position that I’ve had since learning analytics as an external auditor… that external auditors’ use of analytics lags far behind that of internal auditors.  I think a key reason for this is access to and familiarity with data.  As an external auditor it took several weeks for me to gain access to a new client system. Once the client granted my access, I didn’t have much time to pull something useful together.  Rarely did my projects have more than one or two models.  As an internal auditor, I’ve had the same difficulties getting initial access to the system.  But once granted access I can continue to develop models as long as my results are useful to the organization.  On certain projects/systems this period lasted for several years and allowed for deep exploration and understanding.

To me one of the most impactful sentences from this article is “The profession [external auditing] needs to achieve a “quantum leap” to redesign audit processes using today’s technology, rather than using information technology to computerize legacy audit plans and procedures. ”

See more at: http://www.journalofaccountancy.com/issues/2015/apr/data-analytics-for-auditors.html#sthash.QvfD7Lqt.dpuf”

Visualized Correlations

One interesting approach to root cause analysis is to correlate descriptive variables about errors with one another.  I created this correlogram to visualize every possible combination of correlation coefficients among observations from a large information system.  At the intersection of two numbers is a square that represents the correlation of those two variables across hundreds of observations.

2015-05-12.correlogram2

Blue shows a positive correlation, red represents a negative, and darker saturation signifies a stronger relationship.  What trends that might give insights to the root causes?  I chose to explore variables 14 (vertical blue trend), 25 (horizontal), and 27 (horizontal).

The analysis was performed in Excel and also in R using the correlogram package.

Using Regression to Predict Duplicate Payments

Recently  used logistic regression on supersamples from 400,000,000 paired invoices in a payment system to identify the factors that best predict if an invoice was submitted more than once.  Some less scrupulous business partners do this in hopes of getting paid twice for the same job.  Positive values in the graph increase the probability of an erroneous payment, negative values decrease that probability, and the width of the line surrounding each point provides a 95% confidence interval that is based on the observations.

Duplicate_Pmt_Diagnostics2

I expected the invoice number to be a much larger coefficient but it looks like that number is popular to “fudge” for those that are trying to squeeze an extra payment out of a business partner.  It also looks like questionable invoices are more often submitted at values less than $5K, so businesses aren’t willing to take the same risks on high value invoices.  Is this consistent with what your company has experienced?  Has your company used methods other than logistic regression to get different results?  I’d love to hear about it!