It is said that a fair majority of the ETL process is spent dealing with indexes. So, how can we decide when to destroy an index and then recreate it after the table is loaded? Do you maintain these indexes manually? What about when foreign key constraints require that you have an index?
My first attempt at this has been to dynamically generate the indexes based on the system tables. I know that if I do the index manipulation manually, things will be added or deleted without transitioning down into the ETL packages. If there are foreign key constraints, you will probably want to first disable them before dealing with the indexes (or you will need to leave those indexes which the foreign keys reference intact).
After gathering the information about the indexes themselves, I will determine the approximate amount of changed or deleted rows. I do this by loading the deltas (changed records) into a separate schema from the full table. If the total count for the table is more than 5% of the records in the destination table, I’ll go ahead and spend the time deleting and rebuilding the indexes. If not, it is probably less costly to go through the insert or update process with the indexes intact.
All of this was brought up this week because I am working on the second phase of a project which joins a few of our source systems together and conforms the records into one version to pass off to a claims tracking system. In the first phase of this project, the source system is of a sort that doing any sort of incremental load would cost much more time than just doing a truncate and reload (there are no audit columns in the tables). However, for this second database, we have the ability to do this incrementally. In fact, due to the size of the system, it is a requirement.
While processing one of our larger tables, I was noticing that the majority of the time was spent on first deleting and then recreating the indexes. After investigating a little further it was obvious to me that I had forgotten to put one of my checks in to the package. Namely, I was only processing about 70k rows but the table itself has over 100 million – and I was still running the index procedures. There was no check in place. Another issue that I had not really thought about in the past was the difference between a quick count(*) on a table and a query to the system table to determine the number of records. The count(*) was taking about two minutes to complete, while a call to the system tables took under one second. Same result either way, but valuable time saved while dealing with the audit trail. In all, I took the time to process these records down from about 7 minutes to under 20 seconds. Not to mention if we had to do a truncate and full load… that would take about an hour.
So, the moral of the story is, manipulate your indexes wisely. And make sure that the work you are doing with your source and destinations is both efficient and necessary.