Thursday, September 20, 2007

'Extract, Transform, Load'(ETL)

Contents[Hide TOC]
1 DEFINITION
1.1 Extraction
1.2 Transformation
1.3 Loading
2 OVERVIEW
3 ARCHITECTURES
3.1 Timeliness
3.2 Functions and Compatibility
3.3 Security and Compliance
4 COMMON STEPS
5 DEVELOPING APPROACHES
6 TOOLS
6.1 Enterprise ETL Tools
6.2 Premium ETL Tools
6.3 Open Source ETL Tools
7 ALPHABET SOUP
8 QUOTATIONS
8.1 Bill Inmon July 19, 2004
8.2 Claudia Imhoff April, 2005
8.3 Jill Dyché April, 2006
9 ARTICLES
10 REFERENCE
11 SEE ALSO
12 BOOK RECOMMENDATIONS
12.1 The Data Warehouse ETL Toolkit
12.1.1 Review March 22, 2005
12.1.2 Review March 3, 2005
if (window.showTocToggle) { var tocShowText = "Show TOC"; var tocHideText = "Hide TOC"; showTocToggle(); }

[Edit section] DEFINITION
'Extract, Transform, Load'(ETL) is the technology specialized in: 1) the Extraction of data from one or more sources; 2) the varied Transformations of the data (e.g. cleansing, reformatting, standardization, aggregation, or the application of any number of business rules); and 3) the Loading of the resulting data set into specified target systems or file formats.

[Edit section] Extraction
Extraction can be of high volumes of data from a large number of heterogeneous sources including relational, hierarchical and object databases, files with structured or unstructured data, XML documents, web services, RFID systems, and packaged applications like SAP or PeopleSoft. External data may be included, which is often purchased from third-party sources.

[Edit section] Transformation
Transformation can be simple format or type conversions, or complex inegration and logic operations. This is one of the major differentiators between ETL and both EAI(Enterprise Application Integration)) and EII(Enterprise Information Integration). The latter typically get data from one place to another, but offer little in the way of transformation features.

[Edit section] Loading
Loading is the transfer of processed data to the target application. This is not just databases, but may include OLAP products, XML documents, packaged applications, flat files and even desktop productivity applications like spreadsheets. Loading can be performed in different bulk and trickle modes on the basis of data acceptance metrics for each data element in every loading activity. In circumstances of trickle loads, this might mean the backend support for vetting data on a near-real-time basis, or the continuously trickling stream of data only suggests an emphasis on minimized latency for the integration.

[Edit section] OVERVIEW
ETL is heavily used with data warehousing and business intelligence scenarios to get data from source applications into the data warehouse, whereas ETL practices increasingly apply to data movement, consolidation, correction, synchronization and transformation as well as ongoing data import and export processing between OLTP systems. According to Forrester Research(2007)1, ETL has merged in data quality capabilities, and is extending out to subsum the features of EII and EAI to meet growing business requirements - e.g. real-time or near real-time data integration needs. According to Rob Karel from Forrester Research, ETL non-DW integration scenarios include:
Data migrations: Legacy data migration is an operational application of ETL technology that is gaining popularity where traditional data replication software and custom code no longer provide the necessary source system analysis, data profiling, metadata management, and data quality compliance now required. Common data migration examples include the implementation of a new ERP application from one or more legacy applications, an upgrade to the latest version of an enterprise application, or the integration of data from a merger or acquisition.
Compliance: In light of edicts like Base II, organizations need to track and audit the movement of sensitive data, especially financial data, from capture through consumption. The metadata accumulated in the design and implementation of an ETL process flow can often provide the source-to-target insights necessary to meet such critical auditing demands. In such scenarios, more E and L is used than T - that is, extracting data from one standards-compliant data source and copying it, often with only simple transformations (or without any transformations at all), into another standards-compliant database.
Master Data Management(MDM): ETL functions as a foundational component of the MDM system.
Operational BI: ETL technology extends out to offer context-rich, process-aware business intelligence reporting and analysis. The combination of ETL, EII, EAI and CDC(Change Data Capture) are increasingly leveraged to meet the needs of targeted operational BI capabilities.

[Edit section] ARCHITECTURES
An ETL architecture is to define the process and set the necessary standards for data integration, consistency and a topology that supports dynamic change of the business environment. However, the hub-and-spoke concept, the decentralized approaches and the pertinent hardware/software components will not be discussed here, since ETL architectures can be assembled in many different forms using an endless variety of technologies. In some sense, a propriate architecture should be able to take advantage of the software, skill sets, hardware and standards already in place within an organization. Figure 1 shows a typical architecture where ETL is used in different stages:2
________________________________________________________________________________

Operational Data--> --> --> --> DataMart -->Data Consumer
________________ __________ ______________

Operational Data E Stage E ODS E DW E DataMart E Data Consumer
________________ T T T T __________ T ______________
L L L L L
External Data --> --> --> --> DataMart -->Data Consumer
________________ __________ ______________

Operational Data--> --> --> --> DataMart -->Data Consumer
______________________________________________________________________

-->Extracts

-->Extracts

[Edit section] Timeliness
The timeliness of data integration is a significant factor for ETL architectures.

[Edit section] Functions and Compatibility

[Edit section] Security and Compliance
ETL architectures must fit in with the existing organization standards for security and legal compliance.

[Edit section] COMMON STEPS
Data Sourcing and Profiling;
Target Schema Design;
Data Mapping;
Data Extraction;
Data Validation and Cleaning;
Data Transformation;
Data Loading.

[Edit section] DEVELOPING APPROACHES
Poor Person's Approach: ETL processes are not created to handle the complexities of data integration from heterogeneous sources. Instead, this task must be performed dynamically at query time, consuming significant server resources. This can reduce data warehouse usage by prolonging query response times. In addition, integration by users leaves the results of analysis up to interpretation.
Low implementation costs;
High utilization costs.
Iteration-Oriented Approach: ETL processes are centered on the “iterative” approach, whereby the focus is on analytical requirements. Then, with each new analytical application, data sources, ETL processes and data models are rebuilt with each “iteration,” causing more resources and lengthy deployment time for each phase.
Repetitive costs;
Latency of phase delivery.
Subject-Area-by-Subject-Area Approach:
Higher upfront costs;
Durable, flexible;
Low-latency application delivery.

[Edit section] TOOLS

[Edit section] Enterprise ETL Tools
With a variety of ETL architectures evolving in organizations, ETL tools market is transforming from pure-play single product offerings to data integration suites. Enterprise ETL tools are suites with integrated ETL, metadata, data quality and profiling capabilities, including new sources: mining output, federation, services, EAI, semi-structured data.3 The listing of vendors/tools below, as well as the ranking, is based on Forrester's research (i.e. "Enterprise ETL, Q2 2007," see reference 1).
IBM: IBM Information Server (IIS), featuring the aquired Ascential's DataStage;
Informatica: PowerCenter;
Business Objects: Data Integrator;
Oracle: Warehouse Builder, Oracle Data Integrator (ODI, originally Sunopsis' Data Conductor);
Ab Initio: Co>Operating System;
SAS Institute: Data Integration Server, Data Integration Studio, Enterprise Integration Server;
Pervasive Software: Business Integrator Pro, Data Integrator;
Microsoft: SQL Server 2005 Integration Services (SSIS);
iWay Software: DataMigrator;
Sybase: ETL, Data Integration Suite.

[Edit section] Premium ETL Tools
The listing of vendors/tools below is based on a much longer list from Project Management Academy.
Barracuda Software: Barracuda Integrator;
Cognos: DecisionStream;
Corporater: Corporater Transformer
CoSort Company: CoSORT ETL tools;
Crossflo Systems: DataExchange Server;
Embarcadero Technologies: DT Studio;
ETL Solutions: Transformation Manager;
Hummingbird Communications: Hummingbird ETL (formerly Genio Suite);
IKAN Solutions: ETL4ALL, MetaSuite;
Netik: Netik InterView;
Pitney Bowes Group 1 Software: DataFlow;
Syncsort Inc.: DMExpress;
WisdomForce Technologies: FastReader.

[Edit section] Open Source ETL Tools
The listing of vendors/tools below is based on a much longer list from Project Management Academy.
Clover: clover.ETL (supported by OpenSys Company);
Enhydra: Octopus;
Kinetic Networks Inc: KETL tools;
Pentaho Corp.: Kettle Project;
Talend: Talend Open Studio.

[Edit section] ALPHABET SOUP
The following definitions come from Teradata Magazine (September 2004).
EAI (Enterprise Application Integration) - the ability to link transactions across multiple systems through existing applications like SAP, PeopleSoft, I2 and Oracle Financials. EAI is typically a push system that is not utilized for batch transformation.
ETL (Extract-Transform-Load) - generally used to move large sets of data, transform it mid-stream and load it in the target system(s). ETL is usually a pull system; however, some vendors are heading toward push/pull ETL.
ELT (Extract-Load-Transform) - transformation occurs in the database after the loading cycle has been accomplished. ELT is generally a pull system with massive data sets.
ETLT (Extract-Transform-Load-Transform) - a combination of in-stream and in-database transformations. ETLT is generally a pull system utilized mostly in a batch-oriented or small-batch (timed) fashion.
ELR (Extract-Load in Real-time) - the ability to connect to any system, extract data and load it into another system in real-time, with some transformation occurring within the RDBMS. ELR could be a push or a pull system for sending data through the pipes.
ELRD (Extract-Load in Real-time with Dynamic restructuring capabilities) - automatically detects changes to structures and reacts to those changes automatically. ELRD will require a bit of engineering "magic" to develop.

[Edit section] QUOTATIONS

[Edit section] Bill Inmon July 19, 2004
One of the essential components of the data warehouse is the ETL (extract/transform/load) component. The ETL component is the software and procedures that are in place to integrate and convert data from its application format into a truly agency-wide format.
- INTRODUCTION TO THE GOVERNMENT INFORMATION FACTORY (GIF), BI-BESTPRACTICES

[Edit section] Claudia Imhoff April, 2005
ETL is a framework that assimilates data, usually through batch processing, from the operational environment of heterogeneous technologies into integrated, consistent data suitable for consumption by the decision support processes. The target for ETL technology is a database such as a data warehouse, data mart or operational data store.
- UNDERSTANDING THE THREE E'S OF INTEGRATION EAI, EII AND ETL, DMReview

[Edit section] Jill Dyché April, 2006
Extract, transform, load (ETL) is typically dedicated to BI or data warehousing systems and relies on a defined set of rules--which are specified by a human being. Moreover, while ETL can mean record-at-a-time processing, it's more likely to involve bulk data migration. The purpose of ETL is to facilitate the one-way copy of data from one platform to another.
- ETL AND CDI, THE FUNDAMENTAL DIFFERENCES, DMReview

[Edit section] ARTICLES
"Building a Quality BI Framework Solution Starts with a Quality ETL Solution", Brian Swarbrick, dataWarehouse.com, 8/28/2007
"Beyond the Data Warehouse: Architectural Options for Data Integration", Evan Levy, bi-bestpractices.com, 5/30/2007
"The Forrester Wave: Enterprise ETL, Q2 2007", Rob Karel, Forrester Research, 5/2/2007
"Next On The Horizon: Parallel ETL", Sam Sterling, Teradata Magazine, 7/6/2004
"Is Hand-Coded ETL the Way to Go?", Gary Nissen, intelligententerprise.com, 5/31/2003
"Mastering Data Extraction", Ralph Kimball, DBMS, June 1996

[Edit section] REFERENCE
Karel, Rob. 2007. The Forrester Wave: Enterprise ETL, Q2 2007. Forester Research (May).
Pillai, Ananth. Neelakantan, Sivaram. 2007. ETL Tool Evaluation. iGate Global Solutions LTD (June).
Madsen, Mark. 2007. Extract-Transform-Load (ETL) Market Overview and Directions. TDWI (June).

[Edit section] SEE ALSO
CDI (Customer Data Integration);
Data Integration Suite;
EAI (Enterprise Application Integration);
EII (Enterprise Information Integration);
ELT (Extract, Load and Transform);
ELTL (Extract, Load, Transform, Load);
ESB (Enterprise Service Bus);
ETL Software;
PDI (Product Data Integration);

[Edit section] BOOK RECOMMENDATIONS
[Edit section] The Data Warehouse ETL Toolkit
Subtitle: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data
Authors: Ralph Kimball, Joe Caserta
Publishers: Wiley
Publish Date: September 13, 2004
Paperback: 526 pages
ISBN-10: 0764567578
ISBN-13: 978-0764567575
Description:
Cowritten by Ralph Kimball, the world's leading data warehousing authority, whose previous books have sold more than 150,000 copies
Delivers real-world solutions for the most time- and labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) process
Delineates best practices for extracting data from scattered sources, removing redundant and inaccurate data, transforming the remaining data into correctly formatted data structures, and then loading the end product into the data warehouse
Offers proven time-saving ETL techniques, comprehensive guidance on building dimensional structures, and crucial advice on ensuring data quality.
From the Back Cover
The single most authoritative guide on the most difficult phase of building a data warehouse.
The extract, transform, and load (ETL) phase of the data warehouse development life cycle is far and away the most difficult, time-consuming, and labor-intensive phase of building a data warehouse. Done right, companies can maximize their use of data storage; if not, they can end up wasting millions of dollars storing obsolete and rarely used data. Bestselling author Ralph Kimball, along with Joe Caserta, shows you how a properly designed ETL system extracts the data from the source systems, enforces data quality and consistency standards, conforms the data so that separate sources can be used together, and finally delivers the data in a presentation-ready format.
Serving as a road map for planning, designing, building, and running the back-room of a data warehouse, this book provides complete coverage of proven, timesaving ETL techniques. Beginning with a quick overview of ETL fundamentals, it then looks at ETL data structures, both relational and dimensional. The authors show how to build useful dimensional structures, providing practical examples of techniques.
Along the way you'll learn how to:
Plan and design your ETL system;
Choose the appropriate architecture from the many possible options;
Build the development/test/production suite of ETL processes;
Build a comprehensive data cleaning subsystem;
Tune the overall ETL process for optimum performance.
Book Review:
Massimiliano Celaschi (Graffignano, Viterbo Italy)
[Edit section] Review March 22, 2005
This book takes almost all issues in a data warehouse design and represents them oriented to ETL features. Actually, ETLing matches the whole of the data warehouse (more or less), so the need to describe them makes this book an autonomous work you can read without referring to previous books by Kimball.
Sasha Froyland (Northern California)
[Edit section] Review March 3, 2005
A survival guide and a must have for every data warehouse architect. This book is written for architects - not for ETL developers...
Looking for a comparison of ETL tools and which ones do what best? You will not find this here.
A great resource for DW Architects who may have many years of experience working on data warehouse projects but may have not had the opportunity of implementing some more elaborative meta data driven cleaning and conforming schemas - a truly interesting approach yet I'm not sure Ralph Kimball's design with the `survivorship support metadata' schema, could perform fast enough for some of the large data warehouse loading needs of larger organizations.