Twitter

Follow SQLQuill on Twitter

Subscribe

MN Pass

BINARY_CHECKSUM and CHECKSUM_AGG – Combo platter

Recently I got an email from a tester claiming that the IDs of values in static data tables are different in one database than they are in another database.  To check this out, I used a combo of the HandsBINARY_CHECKSUM and CHECKSUM_AGG.  Here’s a breakdown of each:
  • The BINARY_CHECKSUM function returns a checksum value for a row of a table or over a list of expressions.
    • A value is computed for each row, this makes it easy to find rows that are different.
    • This function returns a group of values.
    • Will return a different value for most, but not all, changes to the row, and can be used to detect most row modifications.
  • The CHECKSUM_AGG function returns a checksum of the values in a group.
    • If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change.

So, I figured that I could first find the BINARY_CHECKSUM of each row and then find the CHECKSUM_AGG of all the BINARY_CHECKSUMs. If the tables were different, most of the time (as noted in bold above) these values would be different.

There are two situations that I find this very useful for (I’ll post about these later):

  1. Static data  – data that needs to be the same from one environment to another
  2. Data migration – verifying all the columns get migrated correctly.

Note – there are a few data types that BINARY_CHECKSUM does not support.  They are text, ntext, image, cursor, xml, and noncomparable CLR user-defined types.  In my experience, some of the time update dates and insert dates are not important as well as timestamp data types, so I usually ignore these columns.

So to investigate the tester’s claims, I ran the following on each database:

Select CHECKSUM_AGG(BINARY_CHECKSUM(Col1, Col2, Col3)) from Table1

The values were different, so I removed Col1 from the criteria and ran the following on each database and the values were the same.

Select CHECKSUM_AGG(BINARY_CHECKSUM(Col2, Col3)) from Table1

Therefore, Col2 and Col3 had the same values for both Col2 and Col3 **most likely** but Col1 definitely had different values.  The tester was right.

I then removed the CHECKSUM_AGG to find and then compare the values of each row to find the ones that didn’t match.

Select Col1,
BINARY_CHECKSUM(Col1, Col2, Col3)
from Table1
Order by Col1

Here’s an example that shows that this method doesn’t always work – note that the same distinct values are in each table, but each row is not the same.

CREATE TABLE [dbo].[table1](
[col1] [int],
[col2] [varchar](200))

CREATE TABLE [dbo].[table2](
[col1] [int],
[col2] [varchar](200))

Insert table1 Values
(1, 'Row #1 in Table1, Row #2 in Table2'),
(2, 'Row #2 in Table1, Row #1 in Table2')

Insert table2 Values
(2, 'Row #1 in Table1, Row #2 in Table2'),
(1, 'Row #2 in Table1, Row #1 in Table2')

---These are equal
Select CHECKSUM_AGG(BINARY_CHECKSUM(col1, col2)) from table1
Select CHECKSUM_AGG(BINARY_CHECKSUM(col1, col2)) from table2

---These are NOT equal
Select BINARY_CHECKSUM(col1, col2) from table1
Select BINARY_CHECKSUM(col1, col2) from table2

Despite the small chance of a checksum being equal, this method makes it very easy to quickly identify data inconsistencies in different databases or environments.  If you’re ever need to quickly find inconsistent data, give it a try.

The Official SQLFAQ High Level Overview of Data Migration Testing

I’ve been tasked with the SQL portion of our data migration and conversion testing for our big release we’ve got coming up. What’s that mean?  Well, we’ve got an upgrade coming to the application we support.  We’ve got 15 databases on 7 servers.  All of them have database schema changes coming, data moving around from one table to another, new tables getting created, old ones getting dropped, new columns populated from other databases, etc.  I’ve got to make sure all the data that moves makes it there and is correct.  Here was how I attacked it:

The High Level Overview:

  1. Script production databases from our log shipping servers
  2. Create two copies of each database on a test server(s)
  3. Upgrade one database of each set
  4. Compare schema on Pre and Post upgraded databases
  5. List out all the diffences
  6. Catagorize how each new column or table is populated
  7. Ignore the new tables that are populated on use or populated with defaults – no data migration
  8. Create Checksum and row count scripts for all the migrated data
  9. Run the scripts and record the results on our Prod-Stage environment before the upgrade
  10. Upgrade Prod-Stage databases
  11. Run the scripts and compare the results to the Pre-upgrade.
  12. Address any issues

This approached seemed to work well, so my plan is to write a post on a bunch of these individual steps, so stay tuned.