More Than Just Databases

To often I have seen organisations build a data warehouse and populate it with data from 2-3 corporate databases. Is this valuable, probably, but could we improve the value by combining it with data from outside the organisation, definitely. Consider what a data warehouse actually is – I would describe it as “an accessible repository of valuable data from disparate sources”. The keyword here is disparate. So before reading on, have a think about where you would usefully extract data from.

Potential sources

Potential Sources
WhatExamplesBenefitExtractionComplexitySkills Required
Social NetworksFacebook, TwitterConsumer trends, informal feedbackThese services usually provide an API through which to search and extract data.Hard (mostly qualitative data)Software Engineers
Web ServicesBoM, Stock PricesExternal quantitative data, which can have specific relationships to your dataThough a standardised (w3c) web serviceModerateSoftware Engineers or ETL app
DocumentsWord or Powerpoint Documents, Annual ReportsLow benefit, only useful if the data cannot be found anywhere elsePHPDocX, JavaDocX, MSFilesConverter, Apache PoiHard (mostly qualitative data)Software Engineers
WebsitesAustralia Post, Government Websites, Research Papers, Annual Reports, ForumsBest for quantitative data located in tables. Contact details, addresses, grants and funding informationBixo, DeiXTO, TESS, Beautiful Soup, TagSoupHardSoftware Engineers
EmailsInfo@, sales@, helpdesk@Customer feedback, product trends, country of origin informationIMAP, POP protocolsHard (mostly qualitative data, except for email headers)Software Engineers
Log FilesApache/IIS Webserver logs, LDAP/Active Directory logsCountry of origin, product trends, lead conversionHighly structured formats, easy to extract fromModerateSoftware Engineers or ETL app
DatabasesHR, CRM, CMS, Workload managementSpecific organisational informationSQL or ETL appEasyETL app
SpreadsheetsFinancial spreadsheets, Risk/Issues logsSpecific business unit informationSQL (can be treated as a single database table) or ETL appModerate (Easy to extract from, Hard to know the source of truth)ETL app
Source Code RepositoriesCVS, SVN, GitWorkload management, defect management, issue resolution ratesApplication SpecificHard (No standard API)Software Engineers

Cost v Benefit

There is one major limitation in everything I have discussed above. Cost. It is critical that any new data source goes through a formal return on investment or benefits analysis review.

Is the value of the information worth the effort undertaken to analyse and ETL it?

Quite often the answer is no. Word documents and emails are a great example of high complexity (thus high cost) data sources with minimal value return. Excel spreadsheets and log files are often the exact opposite, low cost and high return.

Websites are always the interesting case; in general the more standardised a website is, the lower the cost and higher the value. Good examples include the Australia post website (for postcode and post office information) and the Australian government online directory (for client contact details)


Image is CC BY - David Fulmer (via Flickr)