More Than Just Databases

Article by Evan Leybourn on 04 Feb 2011 2 minutes read

To often I have seen organisations build a data warehouse and populate it with data from 2-3 corporate databases. Is this valuable, probably, but could we improve the value by combining it with data from outside the organisation, definitely. Consider what a data warehouse actually is – I would describe it as “an accessible repository of valuable data from disparate sources”. The keyword here is disparate. So before reading on, have a think about where you would usefully extract data from.

Potential sources

Potential Sources
What	Examples	Benefit	Extraction	Complexity	Skills Required
Social Networks	Facebook, Twitter	Consumer trends, informal feedback	These services usually provide an API through which to search and extract data.	Hard (mostly qualitative data)	Software Engineers
Web Services	BoM, Stock Prices	External quantitative data, which can have specific relationships to your data	Though a standardised (w3c) web service	Moderate	Software Engineers or ETL app
Documents	Word or Powerpoint Documents, Annual Reports	Low benefit, only useful if the data cannot be found anywhere else	PHPDocX, JavaDocX, MSFilesConverter, Apache Poi	Hard (mostly qualitative data)	Software Engineers
Websites	Australia Post, Government Websites, Research Papers, Annual Reports, Forums	Best for quantitative data located in tables. Contact details, addresses, grants and funding information	Bixo, DeiXTO, TESS, Beautiful Soup, TagSoup	Hard	Software Engineers
Emails	Info@, sales@, helpdesk@	Customer feedback, product trends, country of origin information	IMAP, POP protocols	Hard (mostly qualitative data, except for email headers)	Software Engineers
Log Files	Apache/IIS Webserver logs, LDAP/Active Directory logs	Country of origin, product trends, lead conversion	Highly structured formats, easy to extract from	Moderate	Software Engineers or ETL app
Databases	HR, CRM, CMS, Workload management	Specific organisational information	SQL or ETL app	Easy	ETL app
Spreadsheets	Financial spreadsheets, Risk/Issues logs	Specific business unit information	SQL (can be treated as a single database table) or ETL app	Moderate (Easy to extract from, Hard to know the source of truth)	ETL app
Source Code Repositories	CVS, SVN, Git	Workload management, defect management, issue resolution rates	Application Specific	Hard (No standard API)	Software Engineers

Cost v Benefit

There is one major limitation in everything I have discussed above. Cost. It is critical that any new data source goes through a formal return on investment or benefits analysis review.

Is the value of the information worth the effort undertaken to analyse and ETL it?

Quite often the answer is no. Word documents and emails are a great example of high complexity (thus high cost) data sources with minimal value return. Excel spreadsheets and log files are often the exact opposite, low cost and high return.

Websites are always the interesting case; in general the more standardised a website is, the lower the cost and higher the value. Good examples include the Australia post website (for postcode and post office information) and the Australian government online directory (for client contact details)

Attribution

Image is CC BY - David Fulmer (via Flickr)