Automating the approximate record-matching process

Vassilios S. Verykios, Ahmed Elmagarmid, Elias N. Houstis

Research output: Contribution to journalArticle

55 Citations (Scopus)

Abstract

Data quality has many dimensions one of which is accuracy. Accuracy is usually compromised by errors accidentally or intensionally introduced in a database system. These errors result in inconsistent, incomplete, or erroneous data elements. For example, a small variation in the representation of a data object, produces a unique instantiation of the object being represented. In order to improve the accuracy of the data stored in a database system, we need to compare them either with real-world counter-parts or with other data stored in the same or a different system. In this paper, we address the problem of matching records which refer to the same entity by computing their similarity. Exact record matching has limited applicability in this context since even simple errors like character transpositions cannot be captured in the record-linking process. Our methodology deploys advanced data-mining techniques for dealing with the high computational and inferential complexity of approximate record matching.

Original languageEnglish
Pages (from-to)83-98
Number of pages16
JournalInformation Sciences
Volume126
Issue number1
DOIs
Publication statusPublished - 1 Jul 2000
Externally publishedYes

Fingerprint

Database Systems
Data mining
Transposition
Data Quality
Inconsistent
One Dimension
Linking
Data Mining
Methodology
Computing
Data base
Object
Data quality
Context
Similarity
Character

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Information Systems
  • Information Systems and Management
  • Statistics, Probability and Uncertainty
  • Electrical and Electronic Engineering
  • Statistics and Probability

Cite this

Automating the approximate record-matching process. / Verykios, Vassilios S.; Elmagarmid, Ahmed; Houstis, Elias N.

In: Information Sciences, Vol. 126, No. 1, 01.07.2000, p. 83-98.

Research output: Contribution to journalArticle

Verykios, Vassilios S. ; Elmagarmid, Ahmed ; Houstis, Elias N. / Automating the approximate record-matching process. In: Information Sciences. 2000 ; Vol. 126, No. 1. pp. 83-98.
@article{efe18d51e7804a9caf834f033d8cf2d7,
title = "Automating the approximate record-matching process",
abstract = "Data quality has many dimensions one of which is accuracy. Accuracy is usually compromised by errors accidentally or intensionally introduced in a database system. These errors result in inconsistent, incomplete, or erroneous data elements. For example, a small variation in the representation of a data object, produces a unique instantiation of the object being represented. In order to improve the accuracy of the data stored in a database system, we need to compare them either with real-world counter-parts or with other data stored in the same or a different system. In this paper, we address the problem of matching records which refer to the same entity by computing their similarity. Exact record matching has limited applicability in this context since even simple errors like character transpositions cannot be captured in the record-linking process. Our methodology deploys advanced data-mining techniques for dealing with the high computational and inferential complexity of approximate record matching.",
author = "Verykios, {Vassilios S.} and Ahmed Elmagarmid and Houstis, {Elias N.}",
year = "2000",
month = "7",
day = "1",
doi = "10.1016/S0020-0255(00)00013-X",
language = "English",
volume = "126",
pages = "83--98",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier Inc.",
number = "1",

}

TY - JOUR

T1 - Automating the approximate record-matching process

AU - Verykios, Vassilios S.

AU - Elmagarmid, Ahmed

AU - Houstis, Elias N.

PY - 2000/7/1

Y1 - 2000/7/1

N2 - Data quality has many dimensions one of which is accuracy. Accuracy is usually compromised by errors accidentally or intensionally introduced in a database system. These errors result in inconsistent, incomplete, or erroneous data elements. For example, a small variation in the representation of a data object, produces a unique instantiation of the object being represented. In order to improve the accuracy of the data stored in a database system, we need to compare them either with real-world counter-parts or with other data stored in the same or a different system. In this paper, we address the problem of matching records which refer to the same entity by computing their similarity. Exact record matching has limited applicability in this context since even simple errors like character transpositions cannot be captured in the record-linking process. Our methodology deploys advanced data-mining techniques for dealing with the high computational and inferential complexity of approximate record matching.

AB - Data quality has many dimensions one of which is accuracy. Accuracy is usually compromised by errors accidentally or intensionally introduced in a database system. These errors result in inconsistent, incomplete, or erroneous data elements. For example, a small variation in the representation of a data object, produces a unique instantiation of the object being represented. In order to improve the accuracy of the data stored in a database system, we need to compare them either with real-world counter-parts or with other data stored in the same or a different system. In this paper, we address the problem of matching records which refer to the same entity by computing their similarity. Exact record matching has limited applicability in this context since even simple errors like character transpositions cannot be captured in the record-linking process. Our methodology deploys advanced data-mining techniques for dealing with the high computational and inferential complexity of approximate record matching.

UR - http://www.scopus.com/inward/record.url?scp=0034228352&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034228352&partnerID=8YFLogxK

U2 - 10.1016/S0020-0255(00)00013-X

DO - 10.1016/S0020-0255(00)00013-X

M3 - Article

VL - 126

SP - 83

EP - 98

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

IS - 1

ER -