Data Modeling Made Simple and Data Quality

So, over the past week or so when I went to VSLive! I was also in the process of trying to get a few more technical books read in the hopes that they will help me out with my Business Intelligence / Data Warehousing projects.  I had recently gotten through The Data Warehouse ETL Toolkit by Kimball and Caserta and in it was a pretty darn good chapter on cleansing and conforming data that the authors said had a strong basis in a few books, one of which was Data Quality The Accuracy Dimension and the other has escaped me at the moment (Something by Larry English I believe).  During this time I have also been looking at picking up some Data Modeling and Meta Data Model books and was fortunate enough to get the Data Modeling Made Simple by Steve Hoberman for my birthday (May 10th).  It had been my intention to read through both books at the conference (and yes, I still have to finish the SQL Querying book…. I’ll get to it soon), but, suffice it to say that just didn’t happen. 

I was able to make it through the Hoberman book as it came in at a light 140 pages or so.  This book was written as an introduction type book, maybe more geared to a non-tech savy business analyst or someone without a good grasp on modeling concepts (I probably should have guessed this from the title).  Even with this being the case, however, I was still able to pick up a few hints, tips and tricks from the quick read through.  I would highly recommend this for more of its intended audience, those with little to no working experience with data modeling and or those people who are on the fringes of such activities but don’t really have to go about doing the dirty work themselves.

Now I will try to make it through the Data Quality book…. I am currently just past the first part wherein the authors describe what bad data is and how it might come to be.  I suppose next I’ll be learning about how to correct it 🙂

Anyway, this is all coming about because I am trying to figure out the best practices for use at work.  More than that, I am trying to deal with a good way to work with an aging mainframe system that has seen better days.  I believe that this is a project which will show quite a few inadequacies, although I am not 100% certain of this (I have only found half a dozen or so errors in the way things are or have been handled in the past so far).  It seems that my development for this has come to a near halt as I dig deeper and deeper into the subject and realize how far I really have to go.  Today, I spent a good portion of my time writing up a naming standards and best practices document for the project (and I’m the only one on the project!).  Over the next week I will have to go back and refactor a lot of the code that I had already written to abide by these standards… But, as far as I can tell I’ll be better set doing this now than latter.  Next I will have to determine how best to set up the meta data model and begin to populate it as well as starting to investigate how we will score the system as a whole and more importantly the data that I am retrieving from the mainframe.  All in all, I’m not sure how quickly I’ll be able to advance thorough this project and as an outcome of this I’m not sure how willing management will be to do it right or if they will instead want to "see results" (meaning get it done now, rather than correctly).  Hopefully this will all work itself out…

Here is my Review for The Data Warehouse ETL Toolkit (which I also posted to amazon):

In my estimation The Data Warehouse ETL Toolkit is a good source of
information for the topic that covers the majority of your Data
Warehouse efforts, the ETL process (or ECCD if you prefer, which you
probably will after finishing this volume). I took away some good ideas
on items that I probably would not have considered, mostly due to my
own ignorance, relating to Meta Data, QA and Error Corrections, Data
Lineage and Scoring, etc.

The Authors (Kimball and Caserta) do a good job of pointing out
other source books for items that the user will probably want to look
at in depth.

There is also a pretty good section explaining how to manage your
ETL project, the different roles of people who should be involved and a
pretty good project plan / checklist to use as you are getting started.

My only complaint is that I did not read this prior to starting my
own project and am instead having to correct items as I try to
implement these best practices.

Advertisements
This entry was posted in Data Warehousing. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s