Test Driven Development in Business Intelligence

It has been my experience that testing is always overlooked or poorly done within business intelligence environments. The consequence of this is inefficient designs, significant data integrity and validation issues, as well as defects within ETL and reporting scripts. This, in turn, often leads to expensive consultants who want to start the whole process from the start again. What I want to propose is the use of an agile quality assurance method, called Test-driven development, in the business intelligence context.

Developed by Kent Beck, Test-driven development (TDD) is a agile software development process which aims to improve software quality by focusing on early and regular testing. The core premise is that automated or “unit” tests are written before each software component. So, the development process should look something like this;

TDD
  1. Write the (failing) test
  2. Run all tests (ensuring that the new test fails)
  3. Write the code
  4. Run all tests
  5. Repeat steps 3-4 until the test passes
  6. Refactor the new code to acceptable standards.
Traditional
  1. Write the code
  2. Write the test
  3. Run the test
  4. Repeat step 1 until the test passes
Usual
  1. Write the code

This works in a software engineering context because features can be easily componentised (I may have just made that word up) with a specific and consistent output which makes unit testing simple (in theory anyway). The question becomes, how do we apply this methodology in a BI context.

Firstly we need to define our test scope, which at it's simplest is an ETL script, which given a known input will output data that is reasonable and within expected norms. This scope can include;

  • The number of results – e.g. we know there are 1,000 records in the data source which change daily, so our ETL should extract approximately 1000 records each day.
  • The scale of results – e.g. financial transactions must be between $1 and $1000
  • The relationship of results – e.g. the average or standard deviation should be similar each day
  • Logical tests – e.g. the sum of all transactions for a customer should equal the customers balance
  • Known good tests – e.g. customer 123 exists and has a known form
  • Connection tests – e.g. that the database is available and can be connected to.
  • Relationship tests – e.g. that each transaction has an associated customer

Under the TDD methodology, these tests should be written before any ETL scripts (though usually after the model/schema has been designed). It has been shown that this encourages developers to create simpler designs and inspires confidence in their work (Beck: Test-Driven Development by Example, Addison Wesley, 2003)

As my final comment, to get the largest benefit out of unit testing and to ensure a robust testing process, all tests should be setup and run through good continuous integration environment such as DbFit, Cruise Control or Bamboo. This will automatically run your tests on a scheduled (or triggered) basis and alert the BI administrators of any failing tests, which can forewarn of issues in the data sources and data warehouse.

Attribution

Image is CC BY - Z17R0 (Flickr)