Continuous Integration in Business Intelligence

"... continuous integration (CI) implements continuous processes of applying quality control — small pieces of effort, applied frequently" - Wikipedia

In my blog on Test-driven development I discuss the benefits of writing automated unit tests for business intelligence systems. If you are unfamiliar with the concept of unit tests, please read that blog first or one of the many articles on-line or on Wikipedia on the subject.

Continuous integration systems is the capability to automatically and regularly run all unit tests across the entire data warehouse. This will alert the data warehouse administrators of any failing tests which can, in turn, forewarn of issues in data sources or the data warehouse itself. Integrating this with a version control system (VCS), such as subversion, is even better. Using a repository to store all the DDL, ETL, and report code gives you a complete, searchable and linked history of the data warehouse. A VCS will also tell you who made any change, to what and when. With a CI environment, each commit to the VCS will automatically trigger the run of unit tests, so developers are aware if they introduce any defects immediately.

Another BI function which I use continuous integration for is the automatic and regular generation of user and system documentation. Assuming the data dictionary is kept up to date each time the data source or ETL is modified, we can use this to automatically generate the data warehouse help files and user manuals. With sufficient planning we could also generate the data dictionary automatically from appropriate markup in the ETL scripts themselves.

Lastly, the CI environment can control, automate and maintain the build and deployment processes between your development, test and production environments. Continuous integration systems provide other capabilities, such as code coverage and code standards, which are less useful in a BI context.

Examples of CI tools include;

  • Cruisecontrol (Open Source)
  • Bamboo (Proprietary, integrated with Jira)
  • Team Foundation Server (Proprietary, Microsoft)

There are several benefits to using a continuous integration system in a BI context. If you are writing unit tests, it will improve the management and execution of these tests. The tests will help identify new issues early and as all tests are run regularly it can identify issues in older ETL code due to the latest changes. Lastly, as a general rule, with appropriate automation of repetitive tasks you can utilise your BI staff effectively on higher level tasks such as information analysis.

However, as with all things, there are disadvantages as well. There is an overhead to this process as time needs to be invested in developing the unit tests and maintaining the CI environment. This is also an ongoing process, if the tests and documentation are not kept up to date, that original investment in time and effort becomes wasted. However as the data warehouse becomes more complex this upfront cost has a significant long term savings measure; the investment in testing will reduce the time spent in debugging and enhancing.

Attribution

Image is CC BY-NC-ND - Jim Moran