Humble Trader

Sunday, January 01, 2006

Database Infrastructure Design Part 1

Introduction:

You can't just throw this stuff together. Before setting up your database(s) you need some idea of what the overall system is going to look like.

Aim:

Design a high-level Data Warehouse infrastructure that will keep our ETL jobs and data in order.

Requirements:

None, really. You can design this stuff without ever deploying it, I guess.

Discussion:

I have seen a lot of Data Warehouse infrastructures. Some have been great steaming piles of poo (a certain large Australian bank comes immediately to mind) and some are very good (my client at the time of writing this has REALLY got it's shit together). The structure I am proposing here is just that; a proposal - and therefore an opinion. You are, of course, free to implement whatever you like. You should keep in mind the following though:

  • Never, ever, trust a source system:
    • I have written interfaces for just about every type of system you can think of and they are all crap. Even the good ones were just a little less crappier than the really crap ones. Examples of these are:
      • Mainframe applications that allow the user to enter control codes into text fields.
      • Relational systems where relational integrity is only implemented in the application code - and I'm talking basic master-detail, here. And this was quite recently. Yes, people are still building this stuff.
      • Businesses where two seperate applications handle different parts of the business but don't validate order codes against each other (try putting THOSE records back together!).
      • And so it goes on.
  • 80% of the time, effort and cost of a Data Warehouse goes into ETL. Anything that can manage this better is worth doing.
  • You will need some sort of audit trail. Sooner or later a bean-counter is going to ask how a particular value was arrived at. You need to have a good answer.
  • Time-lag. Target records can be built up from data from a number of sources. Sometimes these don't arrive on time so you have to hold up processing until the sources are complete.
  • ETL scheduling should, as far as possible, be dependency-driven.
Sooo, with this in mind, here is a logical database model (click the picture to enlarge it):



(Note: The TLAs in brackets in the following discussion are the names assigned to the components when they are implemented)

External:

Despite all our efforts data, not subject to our stringent quality criteria, will soon contaminate our pristine system. Managers call this 'source system data'. We call it poo.

There isn't much we can do about this, however. Sooner or later, the client will decide that, as s/he has spent a large lump of wonga on the data warehouse (and commercial systems are VERY expensive), it had better do something useful. We would rather it would just sit there looking pretty but, sadly, the client usually wins this one.

Poo comes from many places and arrives in a variety of formats. The two most common are flat-files and direct database links to source tables.

Staging Database (sta):

This is where you will spend almost all your time because this is the ETL (extract, transformation and load) engine. This is delopyed on its own Oracle instance, the main driving components house within three seperate schemas:

  • Raw Staging Schema (sta): The role of this schema is to acquire data. This takes data from its source format and loads it into Oracle tables. These present a common format to the next stage. There is usually 1 table for each source data delivery object (flatfile, external table, etc.).
  • Validated Staging Schema (val): The role of this schema is to validate incomming data. This is where the poo is turned into gold. Checks include datatypes, dates, complete records, etc. Reports of invalid data can be fed back to the source system developers to shame their blackened hearts, the sons of motherless goats.
  • Operational Data Store (ods): I really hate the name of this, mainly because it has been so abused in the past. It can also send all the wrong signals to the sort of managers who have read 'Data Warehousing for Dummies' and actually think they know something. Some people believe that building this is the point of data warehousing, others believe that it MUST be relational. Ignore them. The role of this schema is to present to the data warehouse proper a paradigm source - paradigm, in this instance, meaning 'model' or 'perfect'. We build a system that looks like the organisation we are modelling has gone out and bought one application to do everything, got it right first time, and done it with the data warehouse in mind. Had the organisation actually done this we would have to go off and scrape a living as VB code monkeys or something equally distasteful. Fortunately, in the entire history of commerce, no organisation has ever done this. Nor is one likely to. Life ain't like that. Our ODS MAY be relational in part - or it may not. It is what it needs to be and it's ours so we can make it look like whatever we want.
Warehouse Database (whs):

This is the Data Warehouse proper and is housed in its own instance (whs). Why seperate the staging and warehouse instances? Because they have very different tuning requirements and under-performing data warehouses are not what we do.

Conclusion:

Armed with this infrastructure, we can go on to build our databases.

0 Comments:

Post a Comment

<< Home