To often I have seen organisations build a data warehouse and populate it with data from 2-3 corporate databases. Is this valuable, probably, but could we improve the value by combining it with data from outside the organisation, definitely. Consider what a data warehouse actually is – I would describe it as “an accessible repository of valuable data from disparate sources”. The keyword here is disparate. So before reading on, have a think about where you would usefully extract data from.
|Social Networks||Facebook, Twitter||Consumer trends, informal feedback||These services usually provide an API through which to search and extract data.||Hard (mostly qualitative data)||Software Engineers|
|Web Services||BoM, Stock Prices||External quantitative data, which can have specific relationships to your data||Though a standardised (w3c) web service||Moderate||Software Engineers or ETL app|
|Documents||Word or Powerpoint Documents, Annual Reports||Low benefit, only useful if the data cannot be found anywhere else||PHPDocX, JavaDocX, MSFilesConverter, Apache Poi||Hard (mostly qualitative data)||Software Engineers|
|Websites||Australia Post, Government Websites, Research Papers, Annual Reports, Forums||Best for quantitative data located in tables. Contact details, addresses, grants and funding information||Bixo, DeiXTO, TESS, Beautiful Soup, TagSoup||Hard||Software Engineers|
|Emails||Info@, sales@, helpdesk@||Customer feedback, product trends, country of origin information||IMAP, POP protocols||Hard (mostly qualitative data, except for email headers)||Software Engineers|
|Log Files||Apache/IIS Webserver logs, LDAP/Active Directory logs||Country of origin, product trends, lead conversion||Highly structured formats, easy to extract from||Moderate||Software Engineers or ETL app|
|Databases||HR, CRM, CMS, Workload management||Specific organisational information||SQL or ETL app||Easy||ETL app|
|Spreadsheets||Financial spreadsheets, Risk/Issues logs||Specific business unit information||SQL (can be treated as a single database table) or ETL app||Moderate (Easy to extract from, Hard to know the source of truth)||ETL app|
|Source Code Repositories||CVS, SVN, Git||Workload management, defect management, issue resolution rates||Application Specific||Hard (No standard API)||Software Engineers|
Cost v Benefit
There is one major limitation in everything I have discussed above. Cost. It is critical that any new data source goes through a formal return on investment or benefits analysis review.
Is the value of the information worth the effort undertaken to analyse and ETL it?
Quite often the answer is no. Word documents and emails are a great example of high complexity (thus high cost) data sources with minimal value return. Excel spreadsheets and log files are often the exact opposite, low cost and high return.
Websites are always the interesting case; in general the more standardised a website is, the lower the cost and higher the value. Good examples include the Australia post website (for postcode and post office information) and the Australian government online directory (for client contact details)
Image is CC BY - David Fulmer (via Flickr)