1. Extract (E) - Retrieving Data:
- SELECT Queries:These are fundamental for pulling data from source tables.
- Example:
SELECT * FROM sales_data WHERE region = 'Bengaluru';
- Example:
- JOIN Queries:Used to combine data from multiple tables based on a related column.
- Example:
SELECT s.order_id, c.customer_name FROM sales_data s JOIN customers c ON s.customer_id = c.customer_id;
- Example:
- WHERE Clause:Filters data based on specific conditions.
- Example:
SELECT * FROM products WHERE category = 'Electronics' AND price > 100;
- Example:
- UPDATE Queries:Modify existing data in a table.
- Example:
UPDATE products SET price = price * 0.9 WHERE category = 'Electronics';
- Example:
- CASE Statements:Perform conditional transformations based on data values.
- Example:
SELECT customer_id, CASE WHEN age < 18 THEN 'Minor' ELSE 'Adult' END AS age_group FROM customers;
- Example:
- String Functions:Manipulate text data (e.g., uppercase, lowercase, substring).
- Example:
SELECT UPPER(product_name) AS product_name FROM products;
- Example:
- Date Functions:Work with date and time data (e.g., extracting year, month, day).
- Example:
SELECT YEAR(order_date) AS order_year FROM sales_data;
- Example:
- Aggregation Functions:Calculate summary statistics (e.g., SUM, AVG, COUNT).
- Example:
SELECT COUNT(*) AS total_orders FROM sales_data;
- Example:
- INSERT INTO Queries:Insert transformed data into the target table.
- Example:
INSERT INTO data_warehouse_table (customer_id, customer_name, age_group) SELECT customer_id, customer_name, age_group FROM transformed_data;
- Example:
- CREATE TABLE Queries:Create the target table if it doesn't exist.
- Example:
CREATE TABLE data_warehouse_table (customer_id INT, customer_name VARCHAR(255), age_group VARCHAR(255));
- Example:
- TRUNCATE TABLE Queries:Clear the target table before loading new data.
- Example:
TRUNCATE TABLE data_warehouse_table;
- Example:
- DELETE Queries:Remove unwanted data from the target table.
- Example:
DELETE FROM data_warehouse_table WHERE customer_id = 123;
- Example:
- MERGE Queries:Perform updates and inserts in a single query.
- Example:
MERGE INTO data_warehouse_table AS target USING transformed_data AS source ON target.customer_id = source.customer_id WHEN MATCHED THEN UPDATE SET target.customer_name = source.customer_name WHEN NOT MATCHED THEN INSERT (customer_id, customer_name) VALUES (source.customer_id, source.customer_name);
- Example:
- To check if all the records from the source system are loaded into the target system, you might use a simple count query on both systems and compare the results:
- -- Count records in the source table
- SELECT COUNT(*) as source_count FROM source_table;
- -- Count records in the target table
- SELECT COUNT(*) as target_count FROM target_table;
- To validate that the transformation rules have been applied correctly, you can write queries that reflect these rules and compare the output against expected results. Suppose a rule states that all emails in the customer data should be converted to lowercase:
- -- Check if email addresses are transformed to lowercase
- SELECT email
- FROM target_table
- WHERE BINARY email != LOWER(email);
- This query selects any email from the target table that does not match its lowercase version, shining light on an issue with the transformation rule.
- For example you can check for null values, duplicates, or correct data ranges. Here’s how you might check for null values in mandatory fields:
- -- Check for NULL values in a mandatory column
- SELECT COUNT(*)
- FROM target_table
- WHERE mandatory_column IS NULL;
- To check for duplicate records:
- -- Check for duplicate records based on a unique identifier
- SELECT unique_identifier, COUNT(*)
- FROM target_table
- GROUP BY unique_identifier
- HAVING COUNT(*) > 1;
- -- Check that data falls within acceptable ranges
- SELECT *
- FROM table_name
- WHERE numeric_column < lower_bound OR numeric_column > upper_bound;
- To verify that data from different sources has been integrated correctly, you might need to compare key fields that should be consistent across the tables:
- -- Compare data across two tables
- SELECT a.key_field, b.key_field
- FROM table_a a
- JOIN table_b b ON a.key_field = b.key_field
- WHERE a.other_field != b.other_field;
Aggregating
Cleaning
Data cleaning refers to identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. Data cleaning involves a range of tasks, including:
- Removing duplicates and correcting errors.
- Standardizing data and converting it to a specific format, such as converting all dates to a consistent format or converting currency values to a standard currency.
- Handling missing data by imputing missing values or removing records with missing data.
- Identifying and removing outliers to get rid of records with extreme values that may skew the analysis.
SELECT product_category, SUM(sale_price) AS total_sales
FROM sales_data
GROUP BY product_category;
Deduplication
Use case
We store our JIRA Changelog to run analysis on how tickets and issues are processed. For each Support ticket, we can see the change and the date associated so we have as many rows as times the ticket was updated. However, in this situation, we want to see only the latest status of the ticket: we need to deduplicate the data and keep only the latest row per ticket.
To do so, we make a first query to select the ID and last update date using a MAX function, then we use a join to get the data needed.
Derivation
SELECT order_id, order_date, order_amount, DATE_TRUNC('month', order_date) AS order_month
FROM orders;
Use case
Let’s say we have a table with prices in Euro, and a table with the exchange rate between Euro and British Pound Sterling. We can use data derivation to compute the price value in British Pound, using the latest change rate.
Filtering
Use case
Let’s take our first survey example. If we want to filter the data to exclude answers provided by children, we can easily do so by using a WHERE clause and adding the condition that the age must be over 18
Integration
This is one of the best ETL transformation tools to help you map data elements from different sources into one dataset, providing a holistic view of the data. The manufacturing industry uses this technique to integrate data from different customer service channels to provide a unified customer experience.
Use case
We have 3 different suppliers for car engines. Each provide their pricing, but they don’t use the same name for the same product category “Diesel Engine”. With Data Integration transformation in ETL, we can use a mapping table to standardize the product category name and make sure we can use a single dataset to analyze data from multiple sources
Joining
Use case
Let’s say we have one table with the list of support tickets open and a customer ID associated with it, and other table with the accounts data. We can join the two tables together using the customer_id to retrieve the number of tickets per account and the account main contact, in order to contact them and prioritize which account needs more support.
Mapping (Integration)
Sorting and ordering
Sorting and ordering are techniques used in data management and analysis to arrange data in a specific sequence based on certain criteria. Sorting refers to the process of arranging data in a specific order. Ordering refers to the process of arranging data in ascending or descending order.
Suppose you have a database of customer orders, and you want to retrieve the top 10 orders by order value. To do this, you would use a SQL query that includes sorting and ordering statements, such as
SELECT customer_name, order_date, order_value
FROM orders
ORDER BY order_value DESC
LIMIT 10;
Splitting
Data splitting is a technique used in data management and analytics to divide a large data set into smaller, more manageable subsets. One common method is to randomly split the data set into two or more subsets, with each subset containing a proportionate sample of the original data. This can be useful for data scientists who need to perform data modeling or training machine learning algorithms, where it is necessary to have a representative sample of the data for testing and validation
Use case
When a data vendor provides the entirety of the dataset on a regular basis, we may want to split the data in a subset in order to get only the last month of data. This last month of data can be stored separately to make faster queries and improve data processing efficiency.
SELECT * FROM DEMO_DB.UK_DEMO.RAW_DATA WHERE date > DATEADD(MONTH, -1, CURRENT_DATE())
Fact Table: Purpose:Stores the core, measurable data or "facts" of a business process, such as sales transactions, website visits, or product inventory. Characteristics: Typically contains numerical data (e.g., sales amount, quantity, cost). Often includes foreign keys that link to dimension tables. Serves as the central table in a star or snowflake schema, with dimension tables surrounding it. Designed to support aggregation and numerical analysis. Example:A table storing sales data might have columns for sale_id, product_id, customer_id, date_id, quantity, and sales_amount.
Dimension Table: Purpose:Provides descriptive information and context for the facts stored in the fact table. Characteristics: Contains attributes or characteristics that help categorize or classify the data (e.g., product name, customer demographics, time period). Typically contains text or categorical data. Often includes a primary key that uniquely identifies each dimension (e.g., product_id, customer_id, date_id). Designed for efficient querying and data organization. Example:A product dimension table might have columns for product_id, product_name, category, and description.
SCD (Slowly Changing Dimension)- Type 0: Retain Original – Attributes that do not change and always retain the original value without tracking history, such as zip codes.
- Type 1: Overwrite – Old data is replaced with new data without retaining historical records, commonly seen in customer address updates.
- Type 2: Add New Row – A new record is added for each change, preserving history in the table, like customer address changes with timestamps.
- Type 3: Add New Attribute – Stores the current and one previous value in columns, useful for product pricing tracking current and prior prices.
- Type 4: Add History Table – Current data stays in the main table, while a separate history table holds past data, such as product price changes.
- Type 6: Combined Approach – It is called Type 6 because it is a hybrid of Types 1, 2, and 3, where columns hold current and historical data, and a version column tracks changes, like employee role and department updates.
Schemas in ETL
- Type 0: Retain Original – Attributes that do not change and always retain the original value without tracking history, such as zip codes.
- Type 1: Overwrite – Old data is replaced with new data without retaining historical records, commonly seen in customer address updates.
- Type 2: Add New Row – A new record is added for each change, preserving history in the table, like customer address changes with timestamps.
- Type 3: Add New Attribute – Stores the current and one previous value in columns, useful for product pricing tracking current and prior prices.
- Type 4: Add History Table – Current data stays in the main table, while a separate history table holds past data, such as product price changes.
- Type 6: Combined Approach – It is called Type 6 because it is a hybrid of Types 1, 2, and 3, where columns hold current and historical data, and a version column tracks changes, like employee role and department updates.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.
Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.
Fact Constellation Schema (Galaxy Schema)
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
The sales fact table is same as that in the star schema.
The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
The sales fact table is same as that in the star schema.
The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.