Image CAPTCHA
Enter the characters shown in the image.

You are here

Transformation is the Differentiator for ETL

When I first started writing about ETL and data warehousing, I wondered why testing was so complicated.  After all, the procedures involve making a copy, doing a cleaning, and making another copy.  It just does not sound that complex.  Isn't it just a really big cookie-cutter database storage device?  Isn’t ETL testing just copying and checking that "before" and "after" match up?  Then I started wondering about that “cleaning” part, transformation, the T in ETL.  What exactly happens there?  You read about the E, T and L steps, and it all sounds very generic and does not go into detail about what actually happens during transformation.  And then I figured out why.

The transformation is determined by the business needs, and the business needs vary wherever you are.  Your industry determines the kind of data you deal with, and how it will need to be used.  A medical database will need to find details of a patient’s medical records.  A retailer may need to track all individual POS sales, both to watch the financial side, and to observe the impact on inventory.  A voter registration database may need to give reports of people using various filters for marketing or polling.  And each of these systems has different details of what kinds of source data it needs as input and how the output data needs to be queried, all of which need to be checked for data quality before anything gets stored.

Some codes have built-in checksum digits, so that a simple mathematical calculation can ensure the validity of a code.  Imagine an inventory system where some new codes are keyed in manually.  At ETL time, the system could throw a warning that a UPC code is invalid (for UPC-A, all single digit mistakes can be detected).  A car dealer can discover that the VIN number for a vehicle was mis-keyed.  A book vendor can know not only that the ISBN number is wrong, but which character the mistake is in, if it is a single digit typo.  A broker or trader can detect an invalid CUSIP of a traded security.

For more complete validation, a more complex validator API must be used.  The codes must map to real things, regardless of whether or not there’s a checksum.  A Social Security Number in a patient or financial record must be an issued number, and should match up with the listed name.  A VIN must match to a real manufactured vehicle, instead of just having a valid checksum.  The UPC must match with a real product that is the one of current interest, and at POS, the cashier ensures that someone did not just switch the stickers.

But not all fields are identifier codes.  You may have a text field, where you want to ensure that only characters are limited to part of the kind listed here: upper case letters, lower case letters, digits, spaces, punctuation.  You may have a field that must have a non-NULL value for the report queries to work.  A date field may need to have standard format, such as MM/DD/YYYY or DD-MM-YY.  A patient’s birthdate must have a non-zero month that’s less than 13, a non-zero day that less than 32 (or less, depending on the month and if it is a leap year), and be less than or equal to the system date.  Perhaps a mandatory field, like a person’s name or transaction date or email, was left blank.

An individual record may be perfectly accurate, but there needs to be a change to the columns.  You may want to change the field order before putting things into storage.  You may need ensure that certain columns are strings of a specific length, so spaces may need to be padded onto the end.  You may need to standardize a field that can have multiple values meaning the same thing, such as TRUE, YES, Y, T, and 1, which may all translate to a positive response.  Perhaps “unit count” and “price per unit” fields go away, to be replaced by “total spent”.  Columns may need to be converted into CSV’s, or vice versa.

You may need to add, delete, or merge records before entry.  You may need to deduplicate entries, so that you don’t end up mailing multiple catalogs to the same person who just happened to place more than one order.  You may want to aggregate records (tracking daily sales or by-customer-per-month sales instead of individual sales, say).  Going the other direction, records with a CSV webcart field may need to be split apart, so a separate record can be made for each individual item sale, for a database tracking individual item sales instead of just a single checkout record.  In fact, the business need may do a combination that aggregates, splits and dedups.

Maybe business requires the new data set to be sorted by a particular field before the load, to cut down on sorting during report generation.

There is a lot more to ETL than just copying and storing.  Transformation is an important factor in determining how ETL needs to work for you.

yes

About The Author: 

Scott Andery is an expert marketer and author who specialize in software testing tools and resources.


 

MANUFACTURERS Wallboard

Testing tool manufacturers world-wide list
10Levels ABID CONSULTING AccelQ Accord Software ActiMind AdaCore
AdaLog AgileLoad AgileWay Agitar Algorismi ALL4TEC
Andreas Kleffel Android Apache Apica Apollo Systems AppAssist.mobi
Applitools AppPerfect Appsee ApTest Assertible Assure
Atlassian AutoIt Consulti .. Automation Anyw .. Automation Cons .. Axosoft Aztaz Software
Backtrace I/O Badboy BlazeMeter Borvid BrowserStack BSQUARE
BStriker Intern .. CA Technologies Canonical Canoo Engineeri .. Catch Software CelestialTeapot
Chris Mallett Cleanscape ClicTest CloudQA Codeborne CodeCentrix
CodePlex projec .. Codoid Cogitek Compuware Configure IT Conflair
ConSol Core Services Coronys Ltd Countersoft CresTech Softwa .. CrossBrowserTes ..
Crosscheck Netw .. Crowdsourced Te .. Cucumber Ltd Cyara Cygnet Infotech DareBoost
Databene Datamatics Glob .. DevExpress DTM soft Dynatrace LLC EasyQA
Eclipse EkaTechserv Elvior Emmanuel Jorge Empirix EPAM Systems
Equafy Esterel Technol .. eXept Software .. Experitest Finaris Froglogic
FrontEndART Ltd GeneXus GitHub project gnoso Google Code Pro .. GrammaTech
Gurock Software HelpSystems HENIX Hewlett Packard .. Hexawise High-Tech Bridg ..
Hiptest Hitex IBM Rational imbus Shanghai Impetus Inflectra
informUp InTENSO - IT Ex .. Ipswitch Jamo Solutions Janova JAR Technologie ..
JBoss Developer jClarity Jellly.io JetBrains Jively jQuery foundati ..
JS Foundation Jspresso Kanoah KMS Technology Kualitee LDRA Limited
Litmus LoadFocus Loadster Perfor .. MarathonITE Marketcircle Marketcircle
Maveryx Meliora Ltd Micro Focus Sof .. Microsoft Mobile Labs Mobile1st
Mockaroo, LLC Monkop Mozila MSys Technologi .. Navicat NeoTys
Neowise Softwar .. NetCart NORIZZK.COM Novosync Mobili .. NRG Global NTT Resonant
OC Systems Odin Technology OpCord Oracle Orcanos Original Softwa ..
OW2 PANAYA Parasoft PassMark Patterson Consu .. Perfecto Mobile
Pivotal, Inc. Plutora Postman (API To .. PractiTest PrimaTest Process One
Programming Res .. Psoda PureLoad PushToTest Python Q-Assurance
QA Systems QACube QASymphony QAWorks QMetry Quali
Qualitia Softwa .. Quality First S .. Quotium RadView Softwar .. Ranorex RedLine13
Reflective Solu .. ReQtest RevDeBug Robotium Tech Rogue Wave Soft .. Rommana Softwar ..
RTTS ruby-doc.org Runscope Sandklef GNU La .. Sauce Labs Seapine Softwar ..
SeleniumHQ Sencha Sensiple Siemens PLM Sof .. SmartBear Softw .. SmarteSoft
SOASTA SoftLogica Softomotive Softsmith Solution-Soft SonarSource
Sourceforge Spirent Technol .. SQS Software Qu .. Square Stimulus Techno .. Swifting AB
Synopsys T-komp T-Plan TechExcel TechTalk Telerik By Prog ..
Tellurium Test Collab Test Goat Test Recon TestCaseLab testCloud.de Gm ..
TestCraft Techn .. Testenium Testim.io TestingBot TestLodge Testmunk
Testomato TestOptimal TestPlant TestPro Testsigma Techn .. Testuff
The Core Bankin .. The MathWorks The Open Group thePHP.cc Thoughtbot Thoughtworks
Tigris.org Time Simulator Top-Q Trace Technolog .. TrendIC TRICENTIS
Tritusa Pty Ltd TWD Solutions P .. TypeMock Tyto Software Ubertesters UniTESK
Universal Test .. Usetrace Ltd Utrecht Univers .. Validata Group Vanamco AG Vector Software
Veracode Verifaya Corpor .. Verit VersionOne Viewlike.us Vornex Inc.
Watir.com WcfStorm Soluti .. We Are Mammoth Web Performance .. Wintask Wireshark Found ..
Worksoft Xceptance XK72 Xpand IT XQual ZAPTEST
Zeenyx Software .. Zephyr Zeta Software zutubi pty

Theme by Danetsoft and Danang Probo Sayekti