During teaching a SQL class I was asked how to delete duplicated records from a table without having a primary key. Since they used Advantage Database Server, the answer was simple: Use the virtual column ‚ROWID‘ as primary key.
For simplicity, let’s start with a table containing only two columns.
CREATE TABLE customer( PreName CHAR(30), LastName CHAR(30));
Then add a couple of records. As you can see, Peter Parker and Homer Simpson are duplicated, and Bruce Wayne even tripled.
INSERT INTO customer VALUES ('Peter','Parker'), ('Peter','Parker'), ('Robin','Hood'), ('Homer', 'Simpson'), ('Homer', 'Simpson'), ('Bruce','Wayne'), ('Bruce','Wayne'), ('Bruce','Wayne'); SELECT PreName, LastName FROM customer ORDER BY PreName, LastName;
You can get the number of duplicates using a COUNT aggregation on the two fields.
SELECT PreName, LastName, COUNT(*) FROM customer GROUP BY PreName, LastName
Most tutorials now use a primary key column to identify the duplicates for deletion. In our case, we use the virtual ROWID. Let’s take a look at this column. ROWID returns an 18 characters string containing a Base64 encoded unique identifier of a row within a table.
SELECT ROWID, PreName, LastName FROM customer
The first 8 characters of the string represent the database identifier, the middle 8 characters the table identifier, and the last 8 characters the encoded physical record number.
Let’s select only DISTINCT records having the lowest ROWID.
SELECT MIN(ROWID), PreName, LastName FROM customer GROUP BY PreName, LastName
And then delete all records, that ar not part of our selection.
DELETE FROM customer WHERE ROWID NOT IN (SELECT MIN(ROWID) FROM customer GROUP BY PreName, LastName)
As you can see, this is quite simple. If you need more columns to identify your duplicates, just add them to the GROUP BY clause.
BTW: you can even GROUP BY a function like SOUNDEX or a self written KOELNER_PHONETIC (for German names, see here).
DELETE FROM customer WHERE ROWID NOT IN (SELECT MIN(ROWID) FROM customer GROUP BY SOUNDEX(TRIM(COALESCE(PreName,''))+TRIM(COALESCE(LastName,''))))