Robust record linkage blocking using suffix arrays

Timothy De Vries, Hui Ke, Sanjay Chawla, Peter Christen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

32 Citations (Scopus)

Abstract

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages305-314
Number of pages10
DOIs
Publication statusPublished - 2009
Externally publishedYes
EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
Duration: 2 Nov 20096 Nov 2009

Other

OtherACM 18th International Conference on Information and Knowledge Management, CIKM 2009
CountryChina
CityHong Kong
Period2/11/096/11/09

Fingerprint

Record linkage
Indexing
Scalability
Grouping
Sliding window
Costs
Data integration
Experiment
Merging
Data base

Keywords

  • Blocking
  • Record linkage
  • Suffix arrays

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

De Vries, T., Ke, H., Chawla, S., & Christen, P. (2009). Robust record linkage blocking using suffix arrays. In International Conference on Information and Knowledge Management, Proceedings (pp. 305-314) https://doi.org/10.1145/1645953.1645994

Robust record linkage blocking using suffix arrays. / De Vries, Timothy; Ke, Hui; Chawla, Sanjay; Christen, Peter.

International Conference on Information and Knowledge Management, Proceedings. 2009. p. 305-314.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

De Vries, T, Ke, H, Chawla, S & Christen, P 2009, Robust record linkage blocking using suffix arrays. in International Conference on Information and Knowledge Management, Proceedings. pp. 305-314, ACM 18th International Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, 2/11/09. https://doi.org/10.1145/1645953.1645994
De Vries T, Ke H, Chawla S, Christen P. Robust record linkage blocking using suffix arrays. In International Conference on Information and Knowledge Management, Proceedings. 2009. p. 305-314 https://doi.org/10.1145/1645953.1645994
De Vries, Timothy ; Ke, Hui ; Chawla, Sanjay ; Christen, Peter. / Robust record linkage blocking using suffix arrays. International Conference on Information and Knowledge Management, Proceedings. 2009. pp. 305-314
@inproceedings{0b0f70ed3ad442a5b82694481b590067,
title = "Robust record linkage blocking using suffix arrays",
abstract = "Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.",
keywords = "Blocking, Record linkage, Suffix arrays",
author = "{De Vries}, Timothy and Hui Ke and Sanjay Chawla and Peter Christen",
year = "2009",
doi = "10.1145/1645953.1645994",
language = "English",
isbn = "9781605585123",
pages = "305--314",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Robust record linkage blocking using suffix arrays

AU - De Vries, Timothy

AU - Ke, Hui

AU - Chawla, Sanjay

AU - Christen, Peter

PY - 2009

Y1 - 2009

N2 - Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.

AB - Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.

KW - Blocking

KW - Record linkage

KW - Suffix arrays

UR - http://www.scopus.com/inward/record.url?scp=74549152150&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=74549152150&partnerID=8YFLogxK

U2 - 10.1145/1645953.1645994

DO - 10.1145/1645953.1645994

M3 - Conference contribution

AN - SCOPUS:74549152150

SN - 9781605585123

SP - 305

EP - 314

BT - International Conference on Information and Knowledge Management, Proceedings

ER -