Duplicate record detection: A survey

Ahmed Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios

Research output: Contribution to journalArticle

1200 Citations (Scopus)

Abstract

Often, in the real world, entitles have two or more representations In databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.

Original languageEnglish
Pages (from-to)1-16
Number of pages16
JournalIEEE Transactions on Knowledge and Data Engineering
Volume19
Issue number1
DOIs
Publication statusPublished - 1 Jan 2007
Externally publishedYes

Fingerprint

Transcription
Scalability

Keywords

  • Data cleaning
  • Data deduplication
  • Data integration
  • Database hardening
  • Duplicate detection
  • Entity matching
  • Entity resolution
  • Fuzzy duplicate detection
  • Identity uncertainty
  • Instance identification
  • Name matching
  • Record linkage

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Electrical and Electronic Engineering
  • Artificial Intelligence
  • Information Systems

Cite this

Duplicate record detection : A survey. / Elmagarmid, Ahmed; Ipeirotis, Panagiotis G.; Verykios, Vassilios S.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 1, 01.01.2007, p. 1-16.

Research output: Contribution to journalArticle

Elmagarmid, Ahmed ; Ipeirotis, Panagiotis G. ; Verykios, Vassilios S. / Duplicate record detection : A survey. In: IEEE Transactions on Knowledge and Data Engineering. 2007 ; Vol. 19, No. 1. pp. 1-16.
@article{770d5ae3b51e41639e21ed0f5ee4d820,
title = "Duplicate record detection: A survey",
abstract = "Often, in the real world, entitles have two or more representations In databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.",
keywords = "Data cleaning, Data deduplication, Data integration, Database hardening, Duplicate detection, Entity matching, Entity resolution, Fuzzy duplicate detection, Identity uncertainty, Instance identification, Name matching, Record linkage",
author = "Ahmed Elmagarmid and Ipeirotis, {Panagiotis G.} and Verykios, {Vassilios S.}",
year = "2007",
month = "1",
day = "1",
doi = "10.1109/TKDE.2007.250581",
language = "English",
volume = "19",
pages = "1--16",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "1",

}

TY - JOUR

T1 - Duplicate record detection

T2 - A survey

AU - Elmagarmid, Ahmed

AU - Ipeirotis, Panagiotis G.

AU - Verykios, Vassilios S.

PY - 2007/1/1

Y1 - 2007/1/1

N2 - Often, in the real world, entitles have two or more representations In databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.

AB - Often, in the real world, entitles have two or more representations In databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.

KW - Data cleaning

KW - Data deduplication

KW - Data integration

KW - Database hardening

KW - Duplicate detection

KW - Entity matching

KW - Entity resolution

KW - Fuzzy duplicate detection

KW - Identity uncertainty

KW - Instance identification

KW - Name matching

KW - Record linkage

UR - http://www.scopus.com/inward/record.url?scp=33845667955&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33845667955&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2007.250581

DO - 10.1109/TKDE.2007.250581

M3 - Article

AN - SCOPUS:33845667955

VL - 19

SP - 1

EP - 16

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 1

ER -