11 Questions to Ask Yourself Before ETLing a New Data Source
byProper analysis and integration of foreign data sources into a Business Intelligence system can be a time consuming and complicated process. For each new data source I always recommend that a brief business case, made up of the following 11 questions, be put together. This helps prioritise the ETLing and identify problem areas for each source prior to beginning the process.
- What quantity of data is being dealt with? - e.g. a CMS with 1,000 records or a financial system with 100,000,000 records
- How quickly does the data grow? - A CMS which adds 100 new pages a day will be easier to manage than a financial management system which add 1,000,000 records
- How frequently does this data change? - Compare the financial system in which the data never changes (and so there is no need to manage data revisions) to a CRM in which the data changes daily
- When does this data change? - We can optimise the ETL process if we know the data is updated at the end of the month, rather than try and ETL the same data every day.
- How much of this data is duplicated elsewhere? - If the data is duplicated in another system, and the other system is authoritative, we can simplify the ETL process.
- How much of this data is obsolete or irrelevant? - This comes down to GIGO. If the data is obsolete or not relevant to current decision making, including it in the data warehouse would be counter productive
- How is this data used? - This question is used to inform the necessity/benefit of the data and should be used as part of the cost-benefit analysis.
- What reports currently use this data, and are they satisfactory? - Often the transactional system (e.g. finance system) provides it's own reports. If these are satisfactory and do not need to strength of a BI system behind it, it de-prioritises the data source for extraction.
- What access is available to the data, and how can we extract it? - This is a purely technical question which informs the effort involved in the ETL process. Is it a database, can we connect to it directly, via ODBC, do we need software engineers to write extraction scripts etc?
- Is external or third party permission required before extracting this data, and how much will that cost? - Sometimes there will be additional license costs to extract the data. This is used as part of the cost-benefit analysis of the data source.
- Does the organisation have the skills in-house to perform the analysis and extraction or will we have to out-source? - Complex data sources may need specialised skillsets. This will generally increase the cost of extraction and is used as part of the cost-benefit analysis.