Thursday, September 20, 2007

'Extract, Transform, Load'(ETL)

Contents[Hide TOC]
1 DEFINITION
1.1 Extraction
1.2 Transformation
1.3 Loading
2 OVERVIEW
3 ARCHITECTURES
3.1 Timeliness
3.2 Functions and Compatibility
3.3 Security and Compliance
4 COMMON STEPS
5 DEVELOPING APPROACHES
6 TOOLS
6.1 Enterprise ETL Tools
6.2 Premium ETL Tools
6.3 Open Source ETL Tools
7 ALPHABET SOUP
8 QUOTATIONS
8.1 Bill Inmon July 19, 2004
8.2 Claudia Imhoff April, 2005
8.3 Jill Dyché April, 2006
9 ARTICLES
10 REFERENCE
11 SEE ALSO
12 BOOK RECOMMENDATIONS
12.1 The Data Warehouse ETL Toolkit
12.1.1 Review March 22, 2005
12.1.2 Review March 3, 2005
if (window.showTocToggle) { var tocShowText = "Show TOC"; var tocHideText = "Hide TOC"; showTocToggle(); }

[Edit section] DEFINITION
'Extract, Transform, Load'(ETL) is the technology specialized in: 1) the Extraction of data from one or more sources; 2) the varied Transformations of the data (e.g. cleansing, reformatting, standardization, aggregation, or the application of any number of business rules); and 3) the Loading of the resulting data set into specified target systems or file formats.

[Edit section] Extraction
Extraction can be of high volumes of data from a large number of heterogeneous sources including relational, hierarchical and object databases, files with structured or unstructured data, XML documents, web services, RFID systems, and packaged applications like SAP or PeopleSoft. External data may be included, which is often purchased from third-party sources.

[Edit section] Transformation
Transformation can be simple format or type conversions, or complex inegration and logic operations. This is one of the major differentiators between ETL and both EAI(Enterprise Application Integration)) and EII(Enterprise Information Integration). The latter typically get data from one place to another, but offer little in the way of transformation features.

[Edit section] Loading
Loading is the transfer of processed data to the target application. This is not just databases, but may include OLAP products, XML documents, packaged applications, flat files and even desktop productivity applications like spreadsheets. Loading can be performed in different bulk and trickle modes on the basis of data acceptance metrics for each data element in every loading activity. In circumstances of trickle loads, this might mean the backend support for vetting data on a near-real-time basis, or the continuously trickling stream of data only suggests an emphasis on minimized latency for the integration.

[Edit section] OVERVIEW
ETL is heavily used with data warehousing and business intelligence scenarios to get data from source applications into the data warehouse, whereas ETL practices increasingly apply to data movement, consolidation, correction, synchronization and transformation as well as ongoing data import and export processing between OLTP systems. According to Forrester Research(2007)1, ETL has merged in data quality capabilities, and is extending out to subsum the features of EII and EAI to meet growing business requirements - e.g. real-time or near real-time data integration needs. According to Rob Karel from Forrester Research, ETL non-DW integration scenarios include:
Data migrations: Legacy data migration is an operational application of ETL technology that is gaining popularity where traditional data replication software and custom code no longer provide the necessary source system analysis, data profiling, metadata management, and data quality compliance now required. Common data migration examples include the implementation of a new ERP application from one or more legacy applications, an upgrade to the latest version of an enterprise application, or the integration of data from a merger or acquisition.
Compliance: In light of edicts like Base II, organizations need to track and audit the movement of sensitive data, especially financial data, from capture through consumption. The metadata accumulated in the design and implementation of an ETL process flow can often provide the source-to-target insights necessary to meet such critical auditing demands. In such scenarios, more E and L is used than T - that is, extracting data from one standards-compliant data source and copying it, often with only simple transformations (or without any transformations at all), into another standards-compliant database.
Master Data Management(MDM): ETL functions as a foundational component of the MDM system.
Operational BI: ETL technology extends out to offer context-rich, process-aware business intelligence reporting and analysis. The combination of ETL, EII, EAI and CDC(Change Data Capture) are increasingly leveraged to meet the needs of targeted operational BI capabilities.

[Edit section] ARCHITECTURES
An ETL architecture is to define the process and set the necessary standards for data integration, consistency and a topology that supports dynamic change of the business environment. However, the hub-and-spoke concept, the decentralized approaches and the pertinent hardware/software components will not be discussed here, since ETL architectures can be assembled in many different forms using an endless variety of technologies. In some sense, a propriate architecture should be able to take advantage of the software, skill sets, hardware and standards already in place within an organization. Figure 1 shows a typical architecture where ETL is used in different stages:2
________________________________________________________________________________

Operational Data--> --> --> --> DataMart -->Data Consumer
________________ __________ ______________

Operational Data E Stage E ODS E DW E DataMart E Data Consumer
________________ T T T T __________ T ______________
L L L L L
External Data --> --> --> --> DataMart -->Data Consumer
________________ __________ ______________

Operational Data--> --> --> --> DataMart -->Data Consumer
______________________________________________________________________

-->Extracts

-->Extracts

[Edit section] Timeliness
The timeliness of data integration is a significant factor for ETL architectures.

[Edit section] Functions and Compatibility

[Edit section] Security and Compliance
ETL architectures must fit in with the existing organization standards for security and legal compliance.

[Edit section] COMMON STEPS
Data Sourcing and Profiling;
Target Schema Design;
Data Mapping;
Data Extraction;
Data Validation and Cleaning;
Data Transformation;
Data Loading.

[Edit section] DEVELOPING APPROACHES
Poor Person's Approach: ETL processes are not created to handle the complexities of data integration from heterogeneous sources. Instead, this task must be performed dynamically at query time, consuming significant server resources. This can reduce data warehouse usage by prolonging query response times. In addition, integration by users leaves the results of analysis up to interpretation.
Low implementation costs;
High utilization costs.
Iteration-Oriented Approach: ETL processes are centered on the “iterative” approach, whereby the focus is on analytical requirements. Then, with each new analytical application, data sources, ETL processes and data models are rebuilt with each “iteration,” causing more resources and lengthy deployment time for each phase.
Repetitive costs;
Latency of phase delivery.
Subject-Area-by-Subject-Area Approach:
Higher upfront costs;
Durable, flexible;
Low-latency application delivery.

[Edit section] TOOLS

[Edit section] Enterprise ETL Tools
With a variety of ETL architectures evolving in organizations, ETL tools market is transforming from pure-play single product offerings to data integration suites. Enterprise ETL tools are suites with integrated ETL, metadata, data quality and profiling capabilities, including new sources: mining output, federation, services, EAI, semi-structured data.3 The listing of vendors/tools below, as well as the ranking, is based on Forrester's research (i.e. "Enterprise ETL, Q2 2007," see reference 1).
IBM: IBM Information Server (IIS), featuring the aquired Ascential's DataStage;
Informatica: PowerCenter;
Business Objects: Data Integrator;
Oracle: Warehouse Builder, Oracle Data Integrator (ODI, originally Sunopsis' Data Conductor);
Ab Initio: Co>Operating System;
SAS Institute: Data Integration Server, Data Integration Studio, Enterprise Integration Server;
Pervasive Software: Business Integrator Pro, Data Integrator;
Microsoft: SQL Server 2005 Integration Services (SSIS);
iWay Software: DataMigrator;
Sybase: ETL, Data Integration Suite.

[Edit section] Premium ETL Tools
The listing of vendors/tools below is based on a much longer list from Project Management Academy.
Barracuda Software: Barracuda Integrator;
Cognos: DecisionStream;
Corporater: Corporater Transformer
CoSort Company: CoSORT ETL tools;
Crossflo Systems: DataExchange Server;
Embarcadero Technologies: DT Studio;
ETL Solutions: Transformation Manager;
Hummingbird Communications: Hummingbird ETL (formerly Genio Suite);
IKAN Solutions: ETL4ALL, MetaSuite;
Netik: Netik InterView;
Pitney Bowes Group 1 Software: DataFlow;
Syncsort Inc.: DMExpress;
WisdomForce Technologies: FastReader.

[Edit section] Open Source ETL Tools
The listing of vendors/tools below is based on a much longer list from Project Management Academy.
Clover: clover.ETL (supported by OpenSys Company);
Enhydra: Octopus;
Kinetic Networks Inc: KETL tools;
Pentaho Corp.: Kettle Project;
Talend: Talend Open Studio.

[Edit section] ALPHABET SOUP
The following definitions come from Teradata Magazine (September 2004).
EAI (Enterprise Application Integration) - the ability to link transactions across multiple systems through existing applications like SAP, PeopleSoft, I2 and Oracle Financials. EAI is typically a push system that is not utilized for batch transformation.
ETL (Extract-Transform-Load) - generally used to move large sets of data, transform it mid-stream and load it in the target system(s). ETL is usually a pull system; however, some vendors are heading toward push/pull ETL.
ELT (Extract-Load-Transform) - transformation occurs in the database after the loading cycle has been accomplished. ELT is generally a pull system with massive data sets.
ETLT (Extract-Transform-Load-Transform) - a combination of in-stream and in-database transformations. ETLT is generally a pull system utilized mostly in a batch-oriented or small-batch (timed) fashion.
ELR (Extract-Load in Real-time) - the ability to connect to any system, extract data and load it into another system in real-time, with some transformation occurring within the RDBMS. ELR could be a push or a pull system for sending data through the pipes.
ELRD (Extract-Load in Real-time with Dynamic restructuring capabilities) - automatically detects changes to structures and reacts to those changes automatically. ELRD will require a bit of engineering "magic" to develop.

[Edit section] QUOTATIONS

[Edit section] Bill Inmon July 19, 2004
One of the essential components of the data warehouse is the ETL (extract/transform/load) component. The ETL component is the software and procedures that are in place to integrate and convert data from its application format into a truly agency-wide format.
- INTRODUCTION TO THE GOVERNMENT INFORMATION FACTORY (GIF), BI-BESTPRACTICES

[Edit section] Claudia Imhoff April, 2005
ETL is a framework that assimilates data, usually through batch processing, from the operational environment of heterogeneous technologies into integrated, consistent data suitable for consumption by the decision support processes. The target for ETL technology is a database such as a data warehouse, data mart or operational data store.
- UNDERSTANDING THE THREE E'S OF INTEGRATION EAI, EII AND ETL, DMReview

[Edit section] Jill Dyché April, 2006
Extract, transform, load (ETL) is typically dedicated to BI or data warehousing systems and relies on a defined set of rules--which are specified by a human being. Moreover, while ETL can mean record-at-a-time processing, it's more likely to involve bulk data migration. The purpose of ETL is to facilitate the one-way copy of data from one platform to another.
- ETL AND CDI, THE FUNDAMENTAL DIFFERENCES, DMReview

[Edit section] ARTICLES
"Building a Quality BI Framework Solution Starts with a Quality ETL Solution", Brian Swarbrick, dataWarehouse.com, 8/28/2007
"Beyond the Data Warehouse: Architectural Options for Data Integration", Evan Levy, bi-bestpractices.com, 5/30/2007
"The Forrester Wave: Enterprise ETL, Q2 2007", Rob Karel, Forrester Research, 5/2/2007
"Next On The Horizon: Parallel ETL", Sam Sterling, Teradata Magazine, 7/6/2004
"Is Hand-Coded ETL the Way to Go?", Gary Nissen, intelligententerprise.com, 5/31/2003
"Mastering Data Extraction", Ralph Kimball, DBMS, June 1996

[Edit section] REFERENCE
Karel, Rob. 2007. The Forrester Wave: Enterprise ETL, Q2 2007. Forester Research (May).
Pillai, Ananth. Neelakantan, Sivaram. 2007. ETL Tool Evaluation. iGate Global Solutions LTD (June).
Madsen, Mark. 2007. Extract-Transform-Load (ETL) Market Overview and Directions. TDWI (June).

[Edit section] SEE ALSO
CDI (Customer Data Integration);
Data Integration Suite;
EAI (Enterprise Application Integration);
EII (Enterprise Information Integration);
ELT (Extract, Load and Transform);
ELTL (Extract, Load, Transform, Load);
ESB (Enterprise Service Bus);
ETL Software;
PDI (Product Data Integration);

[Edit section] BOOK RECOMMENDATIONS
[Edit section] The Data Warehouse ETL Toolkit
Subtitle: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data
Authors: Ralph Kimball, Joe Caserta
Publishers: Wiley
Publish Date: September 13, 2004
Paperback: 526 pages
ISBN-10: 0764567578
ISBN-13: 978-0764567575
Description:
Cowritten by Ralph Kimball, the world's leading data warehousing authority, whose previous books have sold more than 150,000 copies
Delivers real-world solutions for the most time- and labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) process
Delineates best practices for extracting data from scattered sources, removing redundant and inaccurate data, transforming the remaining data into correctly formatted data structures, and then loading the end product into the data warehouse
Offers proven time-saving ETL techniques, comprehensive guidance on building dimensional structures, and crucial advice on ensuring data quality.
From the Back Cover
The single most authoritative guide on the most difficult phase of building a data warehouse.
The extract, transform, and load (ETL) phase of the data warehouse development life cycle is far and away the most difficult, time-consuming, and labor-intensive phase of building a data warehouse. Done right, companies can maximize their use of data storage; if not, they can end up wasting millions of dollars storing obsolete and rarely used data. Bestselling author Ralph Kimball, along with Joe Caserta, shows you how a properly designed ETL system extracts the data from the source systems, enforces data quality and consistency standards, conforms the data so that separate sources can be used together, and finally delivers the data in a presentation-ready format.
Serving as a road map for planning, designing, building, and running the back-room of a data warehouse, this book provides complete coverage of proven, timesaving ETL techniques. Beginning with a quick overview of ETL fundamentals, it then looks at ETL data structures, both relational and dimensional. The authors show how to build useful dimensional structures, providing practical examples of techniques.
Along the way you'll learn how to:
Plan and design your ETL system;
Choose the appropriate architecture from the many possible options;
Build the development/test/production suite of ETL processes;
Build a comprehensive data cleaning subsystem;
Tune the overall ETL process for optimum performance.
Book Review:
Massimiliano Celaschi (Graffignano, Viterbo Italy)
[Edit section] Review March 22, 2005
This book takes almost all issues in a data warehouse design and represents them oriented to ETL features. Actually, ETLing matches the whole of the data warehouse (more or less), so the need to describe them makes this book an autonomous work you can read without referring to previous books by Kimball.
Sasha Froyland (Northern California)
[Edit section] Review March 3, 2005
A survival guide and a must have for every data warehouse architect. This book is written for architects - not for ETL developers...
Looking for a comparison of ETL tools and which ones do what best? You will not find this here.
A great resource for DW Architects who may have many years of experience working on data warehouse projects but may have not had the opportunity of implementing some more elaborative meta data driven cleaning and conforming schemas - a truly interesting approach yet I'm not sure Ralph Kimball's design with the `survivorship support metadata' schema, could perform fast enough for some of the large data warehouse loading needs of larger organizations.

Thursday, August 30, 2007

DWH-BI-COGNOS/Report Studio -- use HTML item to get current user

This is an example of using a javascript function in a Cognos Report Studio report to get the current user.Create a new report in Cognos Report StudioCreate a new prompt page and add a HTML item control to the page.Add the following .js code to the HTML itemWhen the form loads you should see an alert box with the current logged in username
function init(){ getFormElements()}function getFormElements(){ for (i=0; i if(document.forms["formWarpRequest"].elements[i].name == "ui.username"){ alert('THIS IS THE USER: ' + document.forms["formWarpRequest"].elements[i].value); } }}

Add your corporate logo to Cognos Connection

You can customize the Cognos Connection interface for Cognos 8 to resemble your corporate branding. In order to do this you must modify the system.xml file and place your logo within your Cognos install path location.Start off by placing the image in …Cognos 8 install path/c8\webcontent\skins\companyXYZ\branding\ Back up the system.xml file! Open the file using Notepad or an XML editor. Find the element tag “”. Uncomment the xml element tag and then add the following section:

Auditing your Cognos user activity

When managing a PRODUCTION environment you may be required to log or audit user activity. Cognos 8 comes with a sample model and series of reports that can you help you achieve this goal. Ironically enough, the Framework Manager package is called Audit and can be found at …c8_location/webcontent/samples/Models/Audit/Audit.cpf.This sample Framework Manager package will give you a glimpse of some of the items that you can capture during a user session. Accompanying the Audit model is an Audit deployment package which has 10-12 reports. Some examples of the reports are:Execute reports by user -- Lists the reports that were run, by user and by package. It also includes the timestamp and execution time (in milliseconds) for each report. You can select a date and time range, one or more users, one or more packages, and one or more reports. Execution history by user -- Lists the reports that were run alphabetically, along with the package and timestamp, by user, since the logging database was created. It includes the total number of times each user ran each report and the total number of reports each user ranaltogether. It also includes the total number of reports run by all users. You can select one or more users for the report. After you run the audit report, you can choose to view the statistics for a particular report or for all reports.Logon operations by time stamp -- Shows logon and logoff timestamps and operations, by user. It also includes the total number of logons for each user and the total number of logons altogether. You can select the time period and one or more users for the report. Logon operations by user name -- Shows logon and logoff timestamp by user, along with the type of logoff operation that occurred. It also includes the total number of logons for each user and the total number of logons altogether. You can select one or more users for the report. You can also refer to Chapter 16 of the Administration and Security Guide for more details on how to implement and modify the Audit package and the Audit reports.
Posted by Ravi Sankaru Chinta at 12:00 AM
Labels:

hide icon links in the Cognos portal?

1. Stop the Cognos service.2. Open the installation_location/templates/ps/portal/system.xml file in an XML or text editor.3. Use the following syntax in the element to hide a user interface element:For example, the following XML code hides the entire Cognos Connection header and theNew Job button in the toolbar:4. Restart the Cognos service.
Posted by Ravi Sankaru Chinta at 12:02 AM
Labels:

Cognos Server Load Distribution -Tips

Estimating Load Distribution on Cognos ServerIn Cognos 8, load is generated by two main factors. They are 1.) User navigation and processing requests, such as requests to run or view reports, and 2.) Requests made through automated or event-driven processes, including scheduled and burst reports. By determining when users are most likely to be using Cognos 8 and submitting processing requests, you can then decide when to schedule automated processes. This allows you to distribute the processing load evenly over time, so that you make the best use of your system resources to maintain optimal performance. The key to doing this is estimating the number of concurrent users that will be applying load to your Cognos 8 system at any time. Some of the main factors to consider are business hours, business practices, and the geographic distribution of users. These factors can determine how the concurrency rate changes over time, and how you choose to ensure adequate capacity.A Business Intelligence application in which requests are spread evenly throughout the day has a lower peak concurrency ratio than an application in which the majority of requests are limited to a specific time of day. For example, if users are concentrated in one time zone, there will likely be heavy demand during business hours, followed by a period of low demand after hours. In this situation, you may be able to manage peak and non-peak time periods by sharing hardware resources between interactive and non-interactive processes. You would schedule automated activity to run in non-peak times to produce content for retrieval by interactive users in peak times.On the other hand, if your user population is distributed across several time zones, user load on the system tends to be spread out over more hours, and there are fewer available non-peak hours for scheduled activities. In this situation, you may choose to dedicate separate hardware resources for interactive and non-interactive use.Source: Cognos Architecture and Planning Guide 8.1
Posted by Ravi Sankaru Chinta at 12:01 AM
Labels: