finding duplicate records

Discussion:

finding duplicate records

(too old to reply)

r***@citidc.com

2007-03-19 03:49:23 UTC

I'm trying to setup a way to find possible duplicate or near-duplicate
records of contact people. I have a table of 50k records doing an
inner join with itself, something like this:

SELECT p1.fields, p2.fields
FROM table AS p1 INNER JOIN table AS p2 ON p1.lastname = p2.lastname
WHERE p1.id <> p2.id AND
(other criteria)

The id and name and other criteria fields are indexed as well as they
can be. It works fine as long as I add a LIMIT statement to keep it
down to just 10 or 20 records, but if I take this off, it runs on
forever -- so I can't even tell how many potentail records there will
be. If the table grows to 100k or 200k, it'll only get worse.

So my question is -- are there any better ways to go about this?

strawberry

2007-03-19 08:29:35 UTC

Post by r***@citidc.com
I'm trying to setup a way to find possible duplicate or near-duplicate
records of contact people. I have a table of 50k records doing an
SELECT p1.fields, p2.fields
FROM table AS p1 INNER JOIN table AS p2 ON p1.lastname = p2.lastname
WHERE p1.id <> p2.id AND
(other criteria)
The id and name and other criteria fields are indexed as well as they
can be. It works fine as long as I add a LIMIT statement to keep it
down to just 10 or 20 records, but if I take this off, it runs on
forever -- so I can't even tell how many potentail records there will
be. If the table grows to 100k or 200k, it'll only get worse.
So my question is -- are there any better ways to go about this?

I think '<' would be better than '<>'.

1 Reply
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

r***@citidc.com 2007-03-19 03:49:23 UTC

strawberry 2007-03-19 08:29:35 UTC

about - legalese

Loading...