Design a table and index compression strategy.
This objective may include but is not limited to: row vs. page, update frequency, page compression implementation, compressing individual partitions
In this section we’ll take a look at the enterprise only data compression feature. We will be taking a look at some of the differences between Row and Page level compression, how different update patterns lead themselves to choosing one level of compression over the other, how it is that page compression is actually enacted and some of the bonuses of using partitions on your tables and the ability to use different compression levels for different partitions.
- MCM video on New Database Structures in SQL Server 2008.
- The first 15 minutes or so of the presentation explain this very effectively
- Data Compression: Strategy, Capacity Planning and Best Practices white paper
- In depth look at when to apply compression and some of the costs associated with both the use of compression and the effects for taking an existing table or index and applying compression
- SQL Server 2008 Internals – Chapter 07 Special Storage
- Looks closely at the format of the record for the compression as well as going into many of the other details of compression in detail.
Row vs Page
Row Compression essentially takes all fixed length data types and converts them into variable length columns. It is able to do so by using a different record structure, the column descriptor (CD) format. In this format the information necessary to point to a variable length column is reduced from two bytes in the standard format to four bits (1/2 byte). Row compression cannot be used with sparse columns and only works with the information which is stored in the regular portion of the record. This means that any information stored in the overflow or lob allocations will not be compressed. An interesting note is that zeroes or empty strings are not stored since the column metadata already contains information that the column is null or not. For more information about how the row-compressed records are actually stored (ex. short data region vs long data region) see the sql server internals book.
Page compression uses row compression as it’s first step when saving data. When a page of data is first filled, or when a page that was full hits the internal threshold for updates, the page will be evaluated for page compression. I’ll leave the bulk of this explanation for the page compression implementation below.
Page compression is best used when a table is often scanned and is not updated frequently. Row compression is best used when a table will be updated frequently enough that page compression does not make sense. Furthermore, row compression only really makes sense if the record contains fixed length columns that often do not contain a value which would use the full allotment of bytes. You can use the sp_estimate_data_compression_savings stored procedure to see what the estimated savings of using the specified compression would be.
Both row compression and page compression trade off the cost of I/O and memory saved by the compression with additional CPU costs to perform the compression and to retrieve the values. Compressed data is never stored in an uncompressed state, not even in memory, so it will incur the cost of retrieving the values each time that the data is needed. This cost is higher for page compression, although the benefits in memory and I/O are higher as well.
The article on Creating Compressed Tables and Indexes has some more good information on considerations for when to use row and page compression.
For non-heap structures, updates will cause the page level compression to be re-evaluated (after either 25 updates or when # updates / # records is >= 25%) causing a fairly high amount of overhead when the compression information is rebuilt and the page is reorganized. Thus, the update frequency, along with the frequency of scans, the estimated space savings of the row compression and the estimated space savings of page compression, should be used to help determine when to use row level compression versus page level compression. However, if a table is append only – or has a high percentage of appends vs. actual updates, the number of updates might be disregarded when considering page compression.
See the Application Workload section of the Data Compression white paper listed above.
Page Compression Implementation
It should be noted that all data stored with page compression uses row compression as above. This can be thought of as the first step of page compression. The next steps of page compression are only applied once the page is full, or once the page which already has page compression applied reaches the threshold of updates to be re-evaluated.
The second step of page compression uses column value prefix compression. With column value compression the compression information associated with the column stores the longest value that uses the common prefix, then uses a value to show how much of that value is appropriate for the current column + any additional values in that column.
The third step of page compression uses column value dictionary compression. With dictionary compression the compression information looks at binary data for all data types (so it is data type agnostic) to store repeating values in the dictionary of symbols. This dictionary compression looks at the repeating values in all columns, not just in the one like the prefix compression.
Not every page compression table has both information for the prefixes and the dictionary in the compress information (hidden) record ; if only one is applicable that is all that it stores. The data engine will only go to page compression when it crosses a cost threshold, otherwise it is stored as row-compressed. Also, only the leaf level pages of an index are compressed. The non-leaf level are, however, row-compressed. New pages allocated for a heap will not be paged compressed until the heap is rebuilt. Rebuilding a heap will force all non-clustered indexes to be rebuilt.
Compressing Individual Partitions
The compression level of a partitioned table, or partitioned index, can be set at the partition level independent of all other partitions. Thus, one partition might be set to no compression while another is set to page compression.
Recall that when switching between partitions the indexes must be the same, so the empty partition will need to be set with the same level. Also, when merging a partition, the partition will maintain the compression associated with the partition into which the data is being merged. For splitting a partition, the new partition will maintain the same compression associated with the partition from which it was split.