ETL Testing

1. Extract (E) - Retrieving Data:

SELECT Queries:
These are fundamental for pulling data from source tables.
Example: SELECT * FROM sales_data WHERE region = 'Bengaluru';
JOIN Queries:
Used to combine data from multiple tables based on a related column.
Example: SELECT s.order_id, c.customer_name FROM sales_data s JOIN customers c ON s.customer_id = c.customer_id;

WHERE Clause:

Filters data based on specific conditions.

Example: SELECT * FROM products WHERE category = 'Electronics' AND price > 100;

2. Transform (T) - Modifying Data:

UPDATE Queries:

Modify existing data in a table.

Example: UPDATE products SET price = price * 0.9 WHERE category = 'Electronics';

CASE Statements:

Perform conditional transformations based on data values.

Example: SELECT customer_id, CASE WHEN age < 18 THEN 'Minor' ELSE 'Adult' END AS age_group FROM customers;

String Functions:

Manipulate text data (e.g., uppercase, lowercase, substring).

Example: SELECT UPPER(product_name) AS product_name FROM products;

Date Functions:

Work with date and time data (e.g., extracting year, month, day).

Example: SELECT YEAR(order_date) AS order_year FROM sales_data;

Aggregation Functions:

Calculate summary statistics (e.g., SUM, AVG, COUNT).

Example: SELECT COUNT(*) AS total_orders FROM sales_data;

3. Load (L) - Storing Data:

INSERT INTO Queries:

Insert transformed data into the target table.

Example:

INSERT
 INTO data_warehouse_table (customer_id, customer_name, age_group) 
SELECT customer_id, customer_name, age_group FROM transformed_data;

CREATE TABLE Queries:

Create the target table if it doesn't exist.

Example: CREATE TABLE data_warehouse_table (customer_id INT, customer_name VARCHAR(255), age_group VARCHAR(255));

TRUNCATE TABLE Queries:

Clear the target table before loading new data.

Example: TRUNCATE TABLE data_warehouse_table;

DELETE Queries:

Remove unwanted data from the target table.

Example: DELETE FROM data_warehouse_table WHERE customer_id = 123;

MERGE Queries:

Perform updates and inserts in a single query.

Example:

MERGE
 INTO data_warehouse_table AS target USING transformed_data AS source ON
 target.customer_id = source.customer_id WHEN MATCHED THEN UPDATE SET 
target.customer_name = source.customer_name WHEN NOT MATCHED THEN INSERT
 (customer_id, customer_name) VALUES (source.customer_id, 
source.customer_name);

1.Test Data Completeness

To check if all the records from the source system are loaded into the target system, you might use a simple count query on both systems and compare the results:

-- Count records in the source table

SELECT COUNT(*) as source_count FROM source_table;

-- Count records in the target table

SELECT COUNT(*) as target_count FROM target_table;

2.Test Data Transformation Rules

To validate that the transformation rules have been applied correctly, you can write queries that reflect these rules and compare the output against expected results. Suppose a rule states that all emails in the customer data should be converted to lowercase:

-- Check if email addresses are transformed to lowercase

SELECT email

FROM target_table

WHERE BINARY email != LOWER(email);

This query selects any email from the target table that does not match its lowercase version, shining light on an issue with the transformation rule.

3. Test Data Quality

For example you can check for null values, duplicates, or correct data ranges. Here’s how you might check for null values in mandatory fields:

-- Check for NULL values in a mandatory column

SELECT COUNT(*)

FROM target_table

WHERE mandatory_column IS NULL;

To check for duplicate records:

-- Check for duplicate records based on a unique identifier

SELECT unique_identifier, COUNT(*)

FROM target_table

GROUP BY unique_identifier

HAVING COUNT(*) > 1;

-- Check that data falls within acceptable ranges

SELECT *

FROM table_name

WHERE numeric_column < lower_bound OR numeric_column > upper_bound;

4. Test Data Integration

To verify that data from different sources has been integrated correctly, you might need to compare key fields that should be consistent across the tables:

-- Compare data across two tables

SELECT a.key_field, b.key_field

FROM table_a a

JOIN table_b b ON a.key_field = b.key_field

WHERE a.other_field != b.other_field;

Types of Transformations

https://www.rudderstack.com/learn/data-transformation/data-transformation-in-etl/

https://rivery.io/data-learning-center/types-of-etl-data-transformation/

Aggregating

Data aggregation is the process of summarizing, grouping, or consolidating data from multiple rows or records into a single value or set of values. It allows large amounts of data to be reduced to more manageable and meaningful representations.

Cleaning

Data cleaning refers to identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. Data cleaning involves a range of tasks, including:

Removing duplicates and correcting errors.
Standardizing data and converting it to a specific format, such as converting all dates to a consistent format or converting currency values to a standard currency.
Handling missing data by imputing missing values or removing records with missing data.
Identifying and removing outliers to get rid of records with extreme values that may skew the analysis.

SELECT product_category, SUM(sale_price) AS total_sales

FROM sales_data

GROUP BY product_category;

Deduplication

Data deduplication is a technique used in data management to identify and eliminate duplicate data entries within a data set. The goal of data deduplication is to reduce data storage space and improve data management efficiency by eliminating redundant data.

Use case

We store our JIRA Changelog to run analysis on how tickets and issues are processed. For each Support ticket, we can see the change and the date associated so we have as many rows as times the ticket was updated. However, in this situation, we want to see only the latest status of the ticket: we need to deduplicate the data and keep only the latest row per ticket.

To do so, we make a first query to select the ID and last update date using a MAX function, then we use a join to get the data needed.

Derivation

Data derivation is the process of creating new data elements or modifying existing data elements from one or more source data elements using a defined set of rules or formulas. At a technical level, data derivation involves using programming logic or mathematical formulas to create new data values based on one or more source data elements.

A data derivation rule that extracts the month from the order date and stores it in a new data element could be used to generate the order month data element. For example, the following formula could be used in SQL:

SELECT order_id, order_date, order_amount, DATE_TRUNC('month', order_date) AS order_month

FROM orders;

Use case

Let’s say we have a table with prices in Euro, and a table with the exchange rate between Euro and British Pound Sterling. We can use data derivation to compute the price value in British Pound, using the latest change rate.

Filtering

Filtering is a data processing technique that involves selecting a subset of data based on specified criteria. It is commonly used to extract relevant data from large datasets or to remove unwanted data from a dataset.

An example of data filtering would be filtering a dataset of customer orders to show only the orders from a specific date range. Let's say you have a dataset of customer orders with the following columns: OrderID, CustomerID, OrderDate, ProductID, Quantity, and TotalPrice.

Use case

Let’s take our first survey example. If we want to filter the data to exclude answers provided by children, we can easily do so by using a WHERE clause and adding the condition that the age must be over 18

Integration

This is one of the best ETL transformation tools to help you map data elements from different sources into one dataset, providing a holistic view of the data. The manufacturing industry uses this technique to integrate data from different customer service channels to provide a unified customer experience.

Use case

We have 3 different suppliers for car engines. Each provide their pricing, but they don’t use the same name for the same product category “Diesel Engine”. With Data Integration transformation in ETL, we can use a mapping table to standardize the product category name and make sure we can use a single dataset to analyze data from multiple sources

Joining

Data joining is the process of combining two or more data sets by matching rows based on a common attribute or set of attributes. Data joining involves comparing the values in one or more columns of two or more tables or data sets and identifying rows with matching values.

Use case

Let’s say we have one table with the list of support tickets open and a customer ID associated with it, and other table with the accounts data. We can join the two tables together using the customer_id to retrieve the number of tickets per account and the account main contact, in order to contact them and prioritize which account needs more support.

Mapping (Integration)

Data mapping is the process of transforming data from one format or structure to another. It involves identifying the relationships between data elements in different datasets and creating a mapping between them to enable data integration and transformation.

Sorting and ordering

Sorting and ordering are techniques used in data management and analysis to arrange data in a specific sequence based on certain criteria. Sorting refers to the process of arranging data in a specific order. Ordering refers to the process of arranging data in ascending or descending order.

Suppose you have a database of customer orders, and you want to retrieve the top 10 orders by order value. To do this, you would use a SQL query that includes sorting and ordering statements, such as

SELECT customer_name, order_date, order_value

FROM orders

ORDER BY order_value DESC

LIMIT 10;

Splitting

Data splitting is a technique used in data management and analytics to divide a large data set into smaller, more manageable subsets. One common method is to randomly split the data set into two or more subsets, with each subset containing a proportionate sample of the original data. This can be useful for data scientists who need to perform data modeling or training machine learning algorithms, where it is necessary to have a representative sample of the data for testing and validation

Use case

When a data vendor provides the entirety of the dataset on a regular basis, we may want to split the data in a subset in order to get only the last month of data. This last month of data can be stored separately to make faster queries and improve data processing efficiency.

SELECT * FROM DEMO_DB.UK_DEMO.RAW_DATA 
WHERE date > DATEADD(MONTH, -1, CURRENT_DATE())

Fact Table:
Purpose:
Stores the core, measurable data or "facts" of a business process, such as sales transactions, website visits, or product inventory.
Characteristics:
Typically contains numerical data (e.g., sales amount, quantity, cost).
Often includes foreign keys that link to dimension tables.
Serves as the central table in a star or snowflake schema, with dimension tables surrounding it.
Designed to support aggregation and numerical analysis.
Example:
A table storing sales data might have columns for sale_id, product_id, customer_id, date_id, quantity, and sales_amount.

Dimension Table:
Purpose:
Provides descriptive information and context for the facts stored in the fact table.
Characteristics:
Contains attributes or characteristics that help categorize or classify the data (e.g., product name, customer demographics, time period).
Typically contains text or categorical data.
Often includes a primary key that uniquely identifies each dimension (e.g., product_id, customer_id, date_id).
Designed for efficient querying and data organization.
Example:
A product dimension table might have columns for product_id, product_name, category, and description.

SCD (Slowly Changing Dimension)
Type 0: Retain Original – Attributes that do not change and always retain the original value without tracking history, such as zip codes.
Type 1: Overwrite – Old data is replaced with new data without retaining historical records, commonly seen in customer address updates.
Type 2: Add New Row – A new record is added for each change, preserving history in the table, like customer address changes with timestamps.
Type 3: Add New Attribute – Stores the current and one previous value in columns, useful for product pricing tracking current and prior prices.
Type 4: Add History Table – Current data stays in the main table, while a separate history table holds past data, such as product price changes.
Type 6: Combined Approach – It is called Type 6 because it is a hybrid of Types 1, 2, and 3, where columns hold current and historical data, and a version column tracks changes, like employee role and department updates.

Schemas in ETL

Star Schema

Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.

The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.

There is a fact table at the center. It contains the keys to each of four dimensions.

The fact table also contains the attributes, namely dollars sold and units sold.

Snowflake Schema

Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.

Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.

Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.

The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.
Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema (Galaxy Schema)

A fact constellation has multiple fact tables. It is also known as galaxy schema.

The following diagram shows two fact tables, namely sales and shipping.
The sales fact table is same as that in the star schema.

The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.

The shipping fact table also contains two measures, namely dollars sold and units sold.

It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.

ETL Testing

Aggregating

Cleaning

Deduplication

Use case

Derivation

Use case

Filtering

Use case

Integration

Use case

Joining

Use case

Mapping (Integration)

Sorting and ordering

Splitting

Use case

Star Schema

Snowflake Schema

Fact Constellation Schema (Galaxy Schema)

Contact Form

About Me

Post Top Ad

Post Top Ad

Post Top Ad

Hot

ad space

Extra Ads

Advertisement

Advertising

Popular Posts