Discussion:
finding duplicate records
(too old to reply)
r***@citidc.com
2007-03-19 03:49:23 UTC
Permalink
I'm trying to setup a way to find possible duplicate or near-duplicate
records of contact people. I have a table of 50k records doing an
inner join with itself, something like this:

SELECT p1.fields, p2.fields
FROM table AS p1 INNER JOIN table AS p2 ON p1.lastname = p2.lastname
WHERE p1.id <> p2.id AND
(other criteria)

The id and name and other criteria fields are indexed as well as they
can be. It works fine as long as I add a LIMIT statement to keep it
down to just 10 or 20 records, but if I take this off, it runs on
forever -- so I can't even tell how many potentail records there will
be. If the table grows to 100k or 200k, it'll only get worse.

So my question is -- are there any better ways to go about this?
strawberry
2007-03-19 08:29:35 UTC
Permalink
Post by r***@citidc.com
I'm trying to setup a way to find possible duplicate or near-duplicate
records of contact people. I have a table of 50k records doing an
SELECT p1.fields, p2.fields
FROM table AS p1 INNER JOIN table AS p2 ON p1.lastname = p2.lastname
WHERE p1.id <> p2.id AND
(other criteria)
The id and name and other criteria fields are indexed as well as they
can be. It works fine as long as I add a LIMIT statement to keep it
down to just 10 or 20 records, but if I take this off, it runs on
forever -- so I can't even tell how many potentail records there will
be. If the table grows to 100k or 200k, it'll only get worse.
So my question is -- are there any better ways to go about this?
I think '<' would be better than '<>'.

Loading...