Duplicate records in the database are old age problem that almost every organization has to deal with that.Duplicate may appear due to reckless data input, merge records from old systems into new systems, upload lists purchased and many other reasons.Identifying these duplicate records can also be difficult. You can have many people with the first and last names.You can have a person with many addresses, emails or other identifying characteristics.Here we will follow some basic steps to find and remove duplicate records in the database.
- The below script creates Employee table and inserts sample data into the Employee table for the demonstration.
CREATE TABLE Employee ( id INT PRIMARY KEY identity, first_name VARCHAR(50) NOT NULL, last_name VARCHAR(50) NOT NULL, email VARCHAR(255) NOT NULL ); INSERT INTO Employee (first_name,last_name,email) VALUES ('john ','Schmitt','john.schmitt@JCPenney.com'), ('Joney','dey','joney.dey@walmart.com'), ('Park','decosta','Park.decosta@gmail.com'), ('Mat ','Prior','Mat.Prior@amazon.com'), ('amazon','Jonas','nick.Jonas@mac.com'), ('Tom ','Cruise','Tom.Cruise@iphone11.com'), ('microsoft','Kelly','susan.Kelly@comcast.net'), ('Albert ','Pintos','Albert.Pintos@apple.net'), ('Holland','liferway','Holland.liferway@yahoo.com'), ('Alex','Durby','Alex.Durby@gmail.com.com'), ('Silvester','Stolen','Silvester.Stolen@rediff.com'), ('scarlett','Johnson','scarlett.Johnson@microsoft.com'), ('ebay','Kelly','susan.Kelly@comcast.net'), ('Mat ','Prior','Mat.Prior@apple.com'), ('JCPenney','liferway','Holland.liferway@ebay.com');
After execution of above script, we are getting the records in a table like below:-
- Going to check the duplicate in Employee table ,The following query returns the duplicate records in the Employee table:-
SELECT email, COUNT(email) AS Duplicates FROM Employee GROUP BY email HAVING COUNT(email) > 1;
Here we can see, we have three rows with duplicate records.
- Delete duplicate rows using DELETE JOIN statement
DELETE t1 FROM Employee t1 INNER JOIN Employee t2 on t1.id < t2.id AND t1.email = t2.email;
It indicated that three rows have been deleted but we can execute the query again to find duplicate records:-
SELECT email, COUNT(email) FROM Employee GROUP BY email HAVING COUNT(email) > 1;
The above query returns an empty set, which means that the duplicate rows have been deleted from employee table.
- Now verify data from the Employee table where we are able to find all unique records
SELECT * FROM Employee
The duplicate rows with id 4, 7, and 9 have been deleted from employee table