TAILOR: A record linkage toolbox

Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed Elmagarmid

Research output: Chapter in Book/Report/Conference proceedingConference contribution

155 Citations (Scopus)

Abstract

Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data ware-housing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage Toolbox named TAILOR. Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house developed and public domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. Results show that the proposed machine learning record linkage models outperform the existing ones both in accuracy and in performance.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
EditorsR Agrawal, K Dittrich, A Ngu
Pages17-28
Number of pages12
Publication statusPublished - 1 Jan 2002
Externally publishedYes
Event18th International Conference on Data Engineering - San Jose, CA, United States
Duration: 26 Feb 20021 Mar 2002

Other

Other18th International Conference on Data Engineering
CountryUnited States
CitySan Jose, CA
Period26/2/021/3/02

Fingerprint

Cleaning
Learning systems
Data mining
Tuning

ASJC Scopus subject areas

  • Software
  • Engineering(all)
  • Engineering (miscellaneous)

Cite this

Elfeky, M. G., Verykios, V. S., & Elmagarmid, A. (2002). TAILOR: A record linkage toolbox. In R. Agrawal, K. Dittrich, & A. Ngu (Eds.), Proceedings - International Conference on Data Engineering (pp. 17-28)

TAILOR : A record linkage toolbox. / Elfeky, Mohamed G.; Verykios, Vassilios S.; Elmagarmid, Ahmed.

Proceedings - International Conference on Data Engineering. ed. / R Agrawal; K Dittrich; A Ngu. 2002. p. 17-28.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Elfeky, MG, Verykios, VS & Elmagarmid, A 2002, TAILOR: A record linkage toolbox. in R Agrawal, K Dittrich & A Ngu (eds), Proceedings - International Conference on Data Engineering. pp. 17-28, 18th International Conference on Data Engineering, San Jose, CA, United States, 26/2/02.
Elfeky MG, Verykios VS, Elmagarmid A. TAILOR: A record linkage toolbox. In Agrawal R, Dittrich K, Ngu A, editors, Proceedings - International Conference on Data Engineering. 2002. p. 17-28
Elfeky, Mohamed G. ; Verykios, Vassilios S. ; Elmagarmid, Ahmed. / TAILOR : A record linkage toolbox. Proceedings - International Conference on Data Engineering. editor / R Agrawal ; K Dittrich ; A Ngu. 2002. pp. 17-28
@inproceedings{db39a717928e49f8899da7c5c994dcea,
title = "TAILOR: A record linkage toolbox",
abstract = "Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data ware-housing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage Toolbox named TAILOR. Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house developed and public domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. Results show that the proposed machine learning record linkage models outperform the existing ones both in accuracy and in performance.",
author = "Elfeky, {Mohamed G.} and Verykios, {Vassilios S.} and Ahmed Elmagarmid",
year = "2002",
month = "1",
day = "1",
language = "English",
pages = "17--28",
editor = "R Agrawal and K Dittrich and A Ngu",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - TAILOR

T2 - A record linkage toolbox

AU - Elfeky, Mohamed G.

AU - Verykios, Vassilios S.

AU - Elmagarmid, Ahmed

PY - 2002/1/1

Y1 - 2002/1/1

N2 - Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data ware-housing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage Toolbox named TAILOR. Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house developed and public domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. Results show that the proposed machine learning record linkage models outperform the existing ones both in accuracy and in performance.

AB - Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data ware-housing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage Toolbox named TAILOR. Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house developed and public domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. Results show that the proposed machine learning record linkage models outperform the existing ones both in accuracy and in performance.

UR - http://www.scopus.com/inward/record.url?scp=0036203458&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036203458&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0036203458

SP - 17

EP - 28

BT - Proceedings - International Conference on Data Engineering

A2 - Agrawal, R

A2 - Dittrich, K

A2 - Ngu, A

ER -