Handling Duplicate Rows in MySQL: 3 Techniques from Quick Fixes to Heavy-Duty Solutions

MySQL tutorial - IT technology blog
MySQL tutorial - IT technology blog

Duplicate Data: When Your Database Suddenly Starts “Breeding” Junk

Work with databases long enough, and you’ll inevitably face a frustrating situation: a data table where rows should be unique, yet identical records appear. This usually stems from application-side logic bugs, missing UNIQUE constraints, or errors when importing millions of rows from CSV/Excel files.

Early in my career, I had a heart-stopping experience with a payment system. Due to a missing duplicate check at the database layer, a network glitch caused insert commands to repeat, charging 5,200 customers twice. That morning, I was sweating as I sifted through the redundant data to process refunds. This hard-learned lesson left me obsessed with data integrity and always prepared with periodic “cleanup” scripts.

Here are the 3 most practical ways to wipe out redundant data without hanging your system.

Which Method Should You Choose to Delete Duplicates?

In practice, there is no single “best” way for every scenario. Your choice depends on the MySQL version you are running and the size of the data table (from thousands to tens of millions of records).

  • Method 1: CTE (Common Table Expressions) & ROW_NUMBER(): Modern, readable, and very transparent. Requires MySQL 8.0+.
  • Method 2: Self-Join: The traditional but powerful way. Runs smoothly on all versions from 5.6 onwards.
  • Method 3: Temporary Table: The “slow and steady” approach. Best for massive tables over 10GB to avoid long table locks.

Technical Pros and Cons Analysis

1. Common Table Expressions (CTE)

Pros: Clean code. You can run a SELECT first to preview rows marked for deletion before hitting the actual execute button.

Cons: Only supported from MySQL 8.0+. If your project is still running on 5.7, this method won’t work.

2. Self-Join

Pros: Version-independent. High performance if you have indexed the comparison columns.

Cons: A single wrong comparison operator (using > instead of <) could lead to deleting the original record instead of the duplicate.

3. Temporary Table

Pros: Absolutely safe. You copy clean data to a separate location before overwriting, minimizing the risk of system deadlocks (database contention).

Cons: Consumes extra disk space and takes time to copy if the table has tens of millions of rows.

Practical Implementation Advice

Based on my experience:

  • If the table is under 500,000 rows and you’re using MySQL 8.0+: Use CTE for simplicity.
  • Use Self-Join if you are maintaining legacy systems.
  • Prioritize Temporary Tables if the table is dozens of GBs or handling thousands of requests per second.

Detailed Implementation Guide

Suppose your customers table has duplicate emails. We need to keep the record with the smallest id and delete the later arrivals.

CREATE TABLE customers (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(100),
    email VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 1: Identify the “Enemy”

First, let’s count how many groups of data are duplicated:

SELECT email, COUNT(email) 
FROM customers 
GROUP BY email 
HAVING COUNT(email) > 1;

Step 2: Handle with CTE (MySQL 8.0+)

We assign a sequential number to each email group. The first row gets 1, and subsequent duplicates get 2, 3…

WITH cte AS (
    SELECT id, 
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) as row_num
    FROM customers
)
DELETE FROM customers 
WHERE id IN (SELECT id FROM cte WHERE row_num > 1);

Step 3: Handle with Self-Join (Most Common)

The logic here is to compare the customers table with itself. We find rows with matching emails but an ID larger than the original row.

DELETE c1 FROM customers c1
INNER JOIN customers c2 
ON c1.email = c2.email 
WHERE c1.id > c2.id;

Pro tip: If you want to keep the newest customer instead, just change it to c1.id < c2.id.

Step 4: The “Bulletproof” Method using Temporary Tables

Often used during late-night maintenance to ensure zero errors:

  1. Filter unique IDs into a temporary table:
CREATE TEMPORARY TABLE unique_ids AS
SELECT MIN(id) as id
FROM customers
GROUP BY email;
  1. Wipe out all records not in this “safe” list:
DELETE FROM customers 
WHERE id NOT IN (SELECT id FROM unique_ids);

“Hard-learned” Notes Before Execution

Never be overconfident in your SQL skills. Before pressing Enter on a DELETE command, remember:

  1. Always Backup: Use mysqldump to back up the table. It takes only 2 minutes but can save your career if something goes wrong.
  2. Run a SELECT first: Replace DELETE with SELECT * to see if the list of rows to be deleted is correct.
  3. Check Indexes: Ensure the email column is indexed. Otherwise, a large table deletion will hang the server, causing service interruption.

I hope these tips help you clean up your database with confidence. If you encounter any strange errors or have better optimization methods, feel free to leave a comment below so we can discuss!

Share: