How to remove duplicates in SQL?

Question

Design Gurus · Accepted Answer

Removing duplicates in SQL is a common task that helps maintain data integrity and ensures the accuracy of your database. Duplicates can arise due to various reasons, such as data entry errors, import issues, or lack of proper constraints. This guide will walk you through identifying and removing duplicate records using different methods in SQL, complete with examples to illustrate each approach.

1. Understanding Duplicates in SQL

Duplicates in a database refer to records where certain columns have identical values across multiple rows. The definition of a duplicate depends on which columns you consider for comparison. For example, two rows might have the same Email but different EmployeeIDs, or they might have identical values across all columns except for a unique identifier.

2. Identifying Duplicates

Before removing duplicates, it's essential to identify them. Here's how you can find duplicate records based on specific columns.

Example Scenario:

Consider a table Employees with the following structure:

EmployeeID FirstName LastName Email Department
1 John Doe &#106;&#x6f;&#x68;&#x6e;&#x2e;&#x64;&#x6f;&#x65;&#64;&#x65;&#120;&#x61;&#x6d;&#x70;&#108;&#101;&#46;&#x63;&#111;&#x6d; Sales
2 Jane Smith &#106;&#x61;&#x6e;&#x65;&#46;&#x73;&#x6d;&#105;&#116;&#104;&#64;&#x65;&#x78;&#x61;&#x6d;&#112;&#x6c;&#101;&#46;&#99;&#111;&#x6d; Marketing
3 John Doe &#106;&#x6f;&#x68;&#110;&#46;&#100;&#111;&#x65;&#64;&#x65;&#x78;&#x61;&#109;&#x70;&#108;&#x65;&#46;&#x63;&#x6f;&#109; Sales
4 Alice Johnson &#97;&#108;&#105;&#99;&#101;&#x2e;&#106;&#64;&#101;&#x78;&#97;&#x6d;&#x70;&#x6c;&#101;&#46;&#x63;&#111;&#x6d; IT
5 John Doe &#x6a;&#x6f;&#x68;&#x6e;&#x2e;&#100;&#x6f;&#x65;&#64;&#x65;&#x78;&#x61;&#109;&#x70;&#x6c;&#x65;&#46;&#x63;&#111;&#x6d; Sales

Identifying Duplicate Emails:

SELECT Email, COUNT(*)
FROM Employees
GROUP BY Email
HAVING COUNT(*) > 1;

Result:

Email COUNT(*)
&#x6a;&#x6f;&#104;&#x6e;&#46;&#x64;&#x6f;&#101;&#x40;&#101;&#120;&#97;&#x6d;&#x70;&#108;&#101;&#x2e;&#x63;&#111;&#x6d; 3

This query shows that the email john.doe@example.com appears three times in the Employees table, indicating duplicates.

3. Methods to Remove Duplicates

There are several methods to remove duplicates in SQL. Below are the most common and effective approaches:

a. Using Common Table Expressions (CTEs) with ROW_NUMBER()

This method assigns a unique sequential number to each row within a partition of specified columns. Rows with a ROW_NUMBER greater than 1 are considered duplicates and can be deleted.

Steps:

Use a CTE to select all records and assign row numbers partitioned by the columns that define duplicates.
Delete records where the row number is greater than 1.

Example:

WITH CTE_Duplicates AS (
    SELECT 
        EmployeeID,
        FirstName,
        LastName,
        Email,
        Department,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn
    FROM Employees
)
DELETE FROM Employees
WHERE EmployeeID IN (
    SELECT EmployeeID
    FROM CTE_Duplicates
    WHERE rn > 1
);

Explanation:

The CTE CTE_Duplicates partitions the Employees table by Email and assigns a row number (rn) to each record within the partition.
The DELETE statement removes all records from Employees where the EmployeeID is in the set of duplicates (rn > 1), effectively keeping only the first occurrence.

b. Using a Subquery with ROW_NUMBER()

Similar to the CTE method, but implemented directly with a subquery.

Example:

DELETE e
FROM Employees e
INNER JOIN (
    SELECT 
        EmployeeID,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn
    FROM Employees
) dup ON e.EmployeeID = dup.EmployeeID
WHERE dup.rn > 1;

Explanation:

The subquery assigns a row number to each record partitioned by Email.
The INNER JOIN matches the original table with the subquery on EmployeeID.
The WHERE clause deletes records where rn > 1, removing duplicates.

c. Using Self-Joins

This method involves joining the table to itself to identify duplicates.

Example:

DELETE e1
FROM Employees e1
INNER JOIN Employees e2 
    ON e1.Email = e2.Email 
    AND e1.EmployeeID > e2.EmployeeID;

Explanation:

The table Employees is joined to itself (e1 and e2) based on the Email column.
The condition e1.EmployeeID > e2.EmployeeID ensures that for duplicates, only the record with the higher EmployeeID is deleted.
This effectively keeps the first occurrence and removes subsequent duplicates.

How to remove duplicates in SQL?

1. Understanding Duplicates in SQL

2. Identifying Duplicates

3. Methods to Remove Duplicates

a. Using Common Table Expressions (CTEs) with `ROW_NUMBER()`

b. Using a Subquery with `ROW_NUMBER()`

c. Using Self-Joins

d. Creating a Temporary Table with Distinct Records

e. Using `GROUP BY` and Aggregate Functions to Identify Duplicates

4. Choosing the Right Method

5. Best Practices

6. Preventing Duplicates

7. Conclusion

EmployeeID	FirstName	LastName	Email	Department
1	John	Doe	john.doe@example.com	Sales
2	Jane	Smith	jane.smith@example.com	Marketing
3	John	Doe	john.doe@example.com	Sales
4	Alice	Johnson	alice.j@example.com	IT
5	John	Doe	john.doe@example.com	Sales

How to remove duplicates in SQL?

1. Understanding Duplicates in SQL

2. Identifying Duplicates

3. Methods to Remove Duplicates

a. Using Common Table Expressions (CTEs) with ROW_NUMBER()

b. Using a Subquery with ROW_NUMBER()

c. Using Self-Joins

d. Creating a Temporary Table with Distinct Records

e. Using GROUP BY and Aggregate Functions to Identify Duplicates

4. Choosing the Right Method

5. Best Practices

6. Preventing Duplicates

7. Conclusion

a. Using Common Table Expressions (CTEs) with `ROW_NUMBER()`

b. Using a Subquery with `ROW_NUMBER()`

e. Using `GROUP BY` and Aggregate Functions to Identify Duplicates