When I first started writing about ETL and data warehousing, I wondered why testing was so complicated. After all, the procedures involve making a copy, doing a cleaning, and making another copy. It just does not sound that complex. Isn't it just a really big cookie-cutter database storage device? Isn’t ETL testing just copying and checking that "before" and "after" match up? Then I started wondering about that “cleaning” part, transformation, the T in ETL. What exactly happens there? You read about the E, T and L steps, and it all sounds very generic and does not go into detail about what actually happens during transformation. And then I figured out why.
The transformation is determined by the business needs, and the business needs vary wherever you are. Your industry determines the kind of data you deal with, and how it will need to be used. A medical database will need to find details of a patient’s medical records. A retailer may need to track all individual POS sales, both to watch the financial side, and to observe the impact on inventory. A voter registration database may need to give reports of people using various filters for marketing or polling. And each of these systems has different details of what kinds of source data it needs as input and how the output data needs to be queried, all of which need to be checked for data quality before anything gets stored.
Some codes have built-in checksum digits, so that a simple mathematical calculation can ensure the validity of a code. Imagine an inventory system where some new codes are keyed in manually. At ETL time, the system could throw a warning that a UPC code is invalid (for UPC-A, all single digit mistakes can be detected). A car dealer can discover that the VIN number for a vehicle was mis-keyed. A book vendor can know not only that the ISBN number is wrong, but which character the mistake is in, if it is a single digit typo. A broker or trader can detect an invalid CUSIP of a traded security.
For more complete validation, a more complex validator API must be used. The codes must map to real things, regardless of whether or not there’s a checksum. A Social Security Number in a patient or financial record must be an issued number, and should match up with the listed name. A VIN must match to a real manufactured vehicle, instead of just having a valid checksum. The UPC must match with a real product that is the one of current interest, and at POS, the cashier ensures that someone did not just switch the stickers.
But not all fields are identifier codes. You may have a text field, where you want to ensure that only characters are limited to part of the kind listed here: upper case letters, lower case letters, digits, spaces, punctuation. You may have a field that must have a non-NULL value for the report queries to work. A date field may need to have standard format, such as MM/DD/YYYY or DD-MM-YY. A patient’s birthdate must have a non-zero month that’s less than 13, a non-zero day that less than 32 (or less, depending on the month and if it is a leap year), and be less than or equal to the system date. Perhaps a mandatory field, like a person’s name or transaction date or email, was left blank.
An individual record may be perfectly accurate, but there needs to be a change to the columns. You may want to change the field order before putting things into storage. You may need ensure that certain columns are strings of a specific length, so spaces may need to be padded onto the end. You may need to standardize a field that can have multiple values meaning the same thing, such as TRUE, YES, Y, T, and 1, which may all translate to a positive response. Perhaps “unit count” and “price per unit” fields go away, to be replaced by “total spent”. Columns may need to be converted into CSV’s, or vice versa.
You may need to add, delete, or merge records before entry. You may need to deduplicate entries, so that you don’t end up mailing multiple catalogs to the same person who just happened to place more than one order. You may want to aggregate records (tracking daily sales or by-customer-per-month sales instead of individual sales, say). Going the other direction, records with a CSV webcart field may need to be split apart, so a separate record can be made for each individual item sale, for a database tracking individual item sales instead of just a single checkout record. In fact, the business need may do a combination that aggregates, splits and dedups.
Maybe business requires the new data set to be sorted by a particular field before the load, to cut down on sorting during report generation.
There is a lot more to ETL than just copying and storing. Transformation is an important factor in determining how ETL needs to work for you.
Logo | Title | Testing Objectives | Since | Price Quotes | Ping |
---|---|---|---|---|---|
QuerySurge | DataBase testing, Big Data Testing, Data Interface Testing, Data Migration Testing, Data Warehouse Testing, Database Upgrade Testing, ETL Testing | 2,004 | Active | ||
TestBench | DataBase testing, Data Migration Testing, Data Warehouse Testing, Database Upgrade Testing | 2,004 | Active | ||
Commercial Tools |
Theme by Danetsoft and Danang Probo Sayekti