In computing, extract, transformation, loading ( ETL ) refers to the process in database usage and especially in data warehousing. The ETL process became a popular concept in the 1970s. Data extraction is where data is extracted from a homogeneous or heterogeneous source of data; data transformation is where data is altered to be stored in the right format or structure for query and analysis purposes; loading data where data is loaded into the final target database, more specifically, operational data storage, data mart, or data warehouse. A well-designed ETL system extracts data from the source system, enforces data quality and consistency standards, adjusts data so that separate sources can be shared, and ultimately provides data in a ready-made presentation format so that app developers can build applications and end users can make informed decisions.
Because data extraction takes time, it is common to execute three phases in parallel. While the data is being extracted, another transformation process is executed while processing the received data and preparing it to load while loading data begins without waiting for the completion of the previous phase.
ETL systems generally integrate data from multiple applications (systems), usually developed and supported by different vendors or hosted on separate computer hardware. Separate systems that contain original data are often managed and operated by different employees. For example, cost accounting systems can combine data from payroll, sales, and purchases.
Video Extract, transform, load
Extract
The first part of the ETL process involves extracting data from the source system (s). In many cases, this is the most important aspect of ETL, since extracting the data correctly determines the stage for the success of the next process. Most data-warehousing projects combine data from different source systems. Each separate system may also use different organizations and/or data formats. Common data source formats include relational databases, XML, JSON and flat files, but may also include non-relational database structures such as the Information Management System (IMS) or other data structures such as the Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method ( ISAM), or even formats taken from outside sources in a way such as web spidering or scrap-screen. Streaming extracted data sources and loading on-the-fly to destination databases is another way to perform ETLs when there is no data storage between required. In general, the extraction stage aims to convert the data into a single format suitable for processing the transformation.
The intrinsic part of extraction involves data validation to confirm whether data retrieved from the source has the correct/expected value in a particular domain (such as pattern/default or value list). If the data fails the validation rule is denied entirely or partially. Rejected data should ideally be reported back to the source system for further analysis to identify and correct the wrong record. In some cases, the extraction process itself may have to perform data validation rules to receive data and flow to the next phase.
Maps Extract, transform, load
Transform
In the data transformation stage, a set of rules or functions is applied to the extracted data to prepare it for loading into the final target. Some data requires no transformation at all; the data is known as "direct move" or "pass through" data.
An important function of transformation is data cleansing, which aims to simply pass the "exact" data to the target. The challenge when different systems interact is in interacting and communicating relevant systems. The set of characters that may be available in one system may not be the same for others.
In other cases, one or more of the following types of transformations may be required to meet the business and technical needs of the server or data warehouse:
- Selects only certain columns to load: (or selects a null column to not load). For example, if the data source has three columns (alias "attribute"), roll_no, age, and salary, then the election can only take roll_no and salary. Alternatively, the election mechanism can ignore all records in which salaries are absent (salary = zero).
- Translates the code value: ( for example , if the source system encodes the male as "1" and the female as "2", but the warehouse codes men as "M" and women as "F")
- Encoding free form values: ( for example. , mapping "Male" to "M")
- Generate a new calculated value: ( for example. , sale_amount = qty * unit_price)
- Sort or order data by column list to improve search performance
- Merge data from multiple sources ( for example. , search, merge) and duplicate data âââ â¬
- Merge (for example, rollup - summarizes multiple rows of data - total sales for each store, and for each region, etc.)
- Generates a surrogate key value
- Transposing or pivot (changing some columns into multiple rows or vice versa)
- Splits columns into columns ( for example , converts comma-separated listings, set as strings in a single column, into individual values ââin different columns)
- Dispatch the loop column
- Search and validate relevant data from referral tables or files
- Apply any form of data validation; failed validation may result in a full rejection of the data, partial rejection, or no refusal at all, and thus none, in part, or all data submitted to the next step depending on the rule design and exception handling; many of the above transformations may result in exceptions, for example, when code translation parses unknown code in extracted data ââli>
Load
The load phase loads data to the final target, which may be either a simple flat file or a data warehouse. Depending on the organization's requirements, the process varies greatly. Some data warehouses may overwrite existing information with cumulative information; updating extracted data is often done every day, weekly or monthly. Other data warehouses (or even other parts of the same data warehouse) can add new data in historical form periodically - for example, every hour. To understand this, consider the data warehouse needed to maintain last year's sales records. This data warehouse overwrites data older than a year with newer data. However, the entry of data for each one year window is made in historical way. The time and scope to replace or add is a choice of strategic design depending on the time available and business needs. More complex systems can retain audit history and traces of all changes to data loaded in the data warehouse.
When the load phase interacts with the database, the limits defined in the database schema - as well as triggers enabled when loading data - apply (eg, uniqueness, referential integrity, mandatory fields), which also contribute to the overall performance data of ETL process quality.
- For example, a financial institution may have information about customers in several departments and each department may have registered customer information in different ways. The membership department may register a customer by name, while the accounting department may list the subscriber by number. ETL can combine all these data elements and consolidate them into uniform presentations, such as to store in a database or data warehouse.
- Another way companies use ETLs is to move information to other apps permanently. For example, new applications may use other database vendors and most likely very different database schemes. ETL can be used to convert data into an appropriate format for new applications to use.
- An example is the Cost and Cost Recovery System (ECRS) as used by accountants, consultants, and legal firms. Data usually ends in time and billing systems, although some businesses may also utilize raw data for employee productivity reports into Human Resources (personnel department) or equipment usage reports for Facility Management.
Real-life ETL cycles â ⬠<â â¬
The typical real-life ETL cycle consists of the following execution steps:
- Initiation cycle
- Create reference data ââli>
- Extract (from source)
- Validation
- Change (clear, apply business rules, check data integrity, aggregate or disaggregate)
- Stage (loading to the staging table, if used)
- Audit reports (e.g., about compliance with business rules, also, in case of failure, help to diagnose/fix)
- Publish (to the target table)
- Archive
Challenges
The ETL process can involve considerable complexity, and significant operational problems can occur with incorrectly designed ETL systems.
The range of data values ââor data quality in an operational system can exceed designer expectations when validation and transformation rules are determined. The profile of the source data during the data analysis can identify the conditions of data that must be managed by changing the rule specification, leading to an explicit and implicitly validation of the validation rules implemented in the ETL process.
Data warehouses are usually collected from multiple data sources with different formats and destinations. Thus, ETL is the key process for bringing all data together in a standard and homogeneous environment.
Design analysis should establish the scalability of ETL systems throughout its lifetime --- including understanding the volume of data that must be processed in service level agreements. The time available to extract from the source system may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems must scale to process terabytes of data to update the data warehouse with tens of terabytes of data. Increasing the volume of data may require designs that can scale from daily batches to multi-day micro batches for integration with message queuing or real-time data-conversion changes for ongoing transformation and updates.
Performance
ETL vendors benchmark their recording systems across multiple TB (terabytes) per hour (or ~ 1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multi-gigabit-network connections and lots of memory.
In real life, the slowest part of the ETL process usually occurs in the database load phase. Databases can work slowly because they have to maintain concurrency, integrity maintenance, and index. So, for better performance, it might make sense to use:
- Direct path extract method or bulk loading whenever possible (instead of querying the database) to reduce the load on the source system while obtaining high-speed extract
- Most of the transformation processing is outside the database â â¬
- Bulk loading operations whenever possible
However, even using bulk operations, database access is usually an obstacle in the ETL process. Some common methods used to improve performance are:
- Partition (and index) table: try to keep the partition size the same (note the null value that can deflect the partition)
- Perform all validations in the ETL layer before loading: disable integrity checking ( disable constraint ...) in the target database table during loading
- Disable triggers ( disable triggers ...) in the target database table during loading: simulate the effect as a separate step
- Generate the ID in the ETL layer (not in the database)
- Drop index (on table or partition) before loading - and rebuild after load (SQL: drop index ... ; create index ..)
- Use parallel bulk loads when possible - works fine when tables are partitioned or no index (Note: attempting to parallel load into the same table (partition) usually causes keys - if not on the data row, then on the index)
- If there is a requirement for insertion, renewal, or deletion, find out which line should be processed in what way in the ETL layer, and then process these three operations in the database separately; You can often perform bulk loads for insertion, but update and delete is usually via an API (using SQL)
Whether to perform a particular operation in a database or outside may involve trade-offs. For example, removing duplicates using different might be slow in the database; thus, it makes sense to do it outside. On the other hand, if using significantly significantly (x100) reduces the number of rows to be extracted, it makes sense to remove duplicates as early as possible in the database before unpacking.
A common source issue in ETL is the large number of dependencies among ETL jobs. For example, work "B" can not start when "A" job is not finished yet. One can usually achieve better performance by visualizing all the processes on the graph, and trying to reduce the graph makes use of maximum parallelism, and makes "chains" of sequential processing as short as possible. Again, sharing of large tables and indexes can be very helpful.
Other common problems occur when data is spread among multiple databases, and processing is done in the database in sequence. Sometimes database replication may be involved as a method of copying data between databases - it can significantly slow down the entire process. A common solution is to reduce the processing graph to just three layers:
- Resources
- Middle ETL Layer
- Target
This approach allows processing to maximize parallelism. For example, if you need to load data into two databases, you can run the load in parallel (instead of loading to the first - and then replicating to the second).
Sometimes processing must be done in order. For example, dimension (reference) data is required before one can obtain and validate rows for the main "facts" table.
Parallel processing
The latest development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall ETL performance when dealing with large volumes of data.
The ETL application implements three main types of parallelism:
- Data: By splitting one sequential file into a smaller data file to provide parallel access
- Channel pipeline: allows running multiple components simultaneously on the same data stream, e.g. look up the value on record 1 at the same time by adding two fields in note 2
- Component: Runs multiple processes simultaneously on different data streams in the same job, e.g. sorting one input file when deleting duplicates in another file
The three types of parallelism usually operate in one job.
The additional difficulty comes with ensuring that the uploaded data is relatively consistent. Because some source databases may have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold certain data until all sources are synced. Similarly, where a warehouse may have to be reconciled with content in the source system or with a general ledger, establishing sync points and reconciliation becomes necessary.
Rerunnability, recoverability
Data storage procedures typically divide large ETL processes into smaller parts that run sequentially or in parallel. To track data flow, it makes sense to tag each row of data with "row_id", and mark every part of the process with "run_id". If a failure occurs, having this ID helps to roll back and repeat the failed part.
Best practice also asks for checkpoints, which states when certain phases of the process are completed. Once at the checkpoint, it's good to write everything down to disk, clean up some temporary files, take notes, etc.
Virtual ETL
In 2010, data virtualization has begun to advance ETL processing. Application data virtualization to ETL enables the most common ETL task completion of data migration and application integration for multiple scattered data sources. Virtual ETL operates with an abstract representation of objects or entities collected from various relational, semi-structured, and unstructured data sources. ETL tools can take advantage of object-oriented modeling and work with entity representations that are permanently stored in a central hub-and-spoke architecture. Collections containing entity or object representations collected from data sources for ETL processing are referred to as metadata repositories and can be stored in memory or made persistent. By using persistent metadata repositories, ETL tools can transition from one-time projects into persistent middleware, harmonize data, and create data profiles consistently and near real time.
Dealing with keys
Unique keys play an important role in all relational databases, because both bind everything together. A unique key is a column that identifies a particular entity, while a foreign key is a column in another table that refers to a primary key. A key can consist of several columns, in which case it is a combined key. In many cases, the primary key is an automatically generated integer that has no meaning for the represented business entity, but only exists for the purpose of a relational database - commonly referred to as a surrogate key.
Since there is usually more than one data source loaded into the warehouse, the key is an important issue to address. For example: customers may be represented in multiple data sources, with their Social Security Number as primary key in one source, their phone number in the other, and a third in place. However, the data warehouse may require consolidating all customer information into one dimension.
The recommended way to handle this problem is to add a warehouse replacement key, which is used as a foreign key from the fact table.
Typically, updates occur on dimension source data, which should clearly be reflected in the data warehouse.
If the primary key of the source data is required for reporting, that dimension already contains pieces of information for each row. If the data source uses a surrogate key, the warehouse must keep track of it even though it has never been used in questions or reports; this is done by creating a lookup table that contains the warehouse replacement keys and the originating keys. In this way, the dimensions are not contaminated with replacements from various source systems, while the ability to update is maintained.
The search table is used in different ways depending on the nature of the data source. There are 5 types to consider; three are included here:
- Type 1
- Dimension lines are only updated to adjust the current system state of the source; the warehouse does not capture history; lookup table is used to identify dimension lines to update or overwrite
- Type 2
- New dimension row added to the source system's new status; new replacement keys assigned; source key is no longer unique in the lookup table
- Entrances fully
- A new dimension line is added to the source system's new state, while the previous dimension row is updated to reflect inactivity and deactivation time.
Tools
Using an existing ETL framework, one can increase one's chances to end up with better connectivity and scalability. A good ETL tool should be able to communicate with various relational databases and read various file formats used throughout the organization. ETL tools have begun to migrate to Enterprise Application Integration, or even Enterprise Service Bus, the system now includes more than data extraction, transformation and loading. Many ETL vendors now have data profiles, data quality, and metadata capabilities. Common use cases for ETL tools include converting CSV files to a format that can be read by a relational database. The millions of records translation is facilitated by an ETL tool that allows users to insert a csv data feed/data file and import it into the database with as little code as possible.
ETL tools are commonly used by professionals - from students in computer science who want to quickly import large datasets into database architects responsible for enterprise account management, ETL tools have become a reliable tool that can be relied upon to get maximum performance. ETL tools in many cases contain a GUI that helps users easily change data, using a visual data mapper, as opposed to writing large programs to decipher files and change data types.
Although ETL tools are typically used for developers and IT staff, the new trend is to provide these capabilities to business users so they can make connection and data integration when needed, rather than going to IT staff. Gartner refers to these non-technical users as Citizen Integrator.
ETL vs. ELT
There are two broad categories of ETL tools. First-generation ETL tools run on-site, and connect to local data warehouses that typically run on high-end hardware with a specific purpose in the organization's data center. In that environment, it makes sense to do as much of the preparatory work (ie the transformation) before loading the data, to avoid the expensive consumption of computing cycles that analysts need on special hardware.
Since 2013, however, cloud-based data warehouses such as Amazon Redshift, Google BigQuery and Snowflake Computing have been able to provide almost unlimited computing power. This allows businesses to preload transform and replicate raw data into their data warehouse, where it can change it as needed using SQL. With modern tools, ETL, in essence, becomes an ELT. Many people still use "ETL" to refer to the data channel type.
See also
- Architectural pattern (EA reference architecture)
- Create, read, update and delete (CRUD)
- Data cleaning â ⬠<â â¬
- Data integration â ⬠<â â¬
- Mart Data âââ â¬
- Data mediation â â¬
- Data migration â ⬠<â â¬
- Electronic Data Interchange (EDI)
- Company architecture
- Cost and Cost Recovery System (ECRS)
- Hartmann pipeline
- Official Electronic Data Exchange Standard (LEDES)
- Metadata discovery â ⬠<â â¬
- Online analytics processing
- ETL Spatial
References
Source of the article : Wikipedia