Robust record linkage blocking using suffix arrays and bloom filters

Timothy De Vries, Hui Ke, Sanjay Chawla, Peter Christen

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an eficient and highly scalable blocking approach based on sufix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base sufix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using eficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.

Original languageEnglish
Article number9
JournalACM Transactions on Knowledge Discovery from Data
Volume5
Issue number2
DOIs
Publication statusPublished - Feb 2011
Externally publishedYes

Fingerprint

Scalability
Data integration
Merging
Data storage equipment
Costs
Experiments

Keywords

  • Blocking
  • Record linkage
  • Suffix arrays

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Robust record linkage blocking using suffix arrays and bloom filters. / De Vries, Timothy; Ke, Hui; Chawla, Sanjay; Christen, Peter.

In: ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 2, 9, 02.2011.

Research output: Contribution to journalArticle

@article{6c677ae5e27e4d90ba2b015360f52e68,
title = "Robust record linkage blocking using suffix arrays and bloom filters",
abstract = "Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an eficient and highly scalable blocking approach based on sufix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base sufix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using eficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70{\%} in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.",
keywords = "Blocking, Record linkage, Suffix arrays",
author = "{De Vries}, Timothy and Hui Ke and Sanjay Chawla and Peter Christen",
year = "2011",
month = "2",
doi = "10.1145/1921632.1921635",
language = "English",
volume = "5",
journal = "ACM Transactions on Knowledge Discovery from Data",
issn = "1556-4681",
publisher = "Association for Computing Machinery (ACM)",
number = "2",

}

TY - JOUR

T1 - Robust record linkage blocking using suffix arrays and bloom filters

AU - De Vries, Timothy

AU - Ke, Hui

AU - Chawla, Sanjay

AU - Christen, Peter

PY - 2011/2

Y1 - 2011/2

N2 - Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an eficient and highly scalable blocking approach based on sufix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base sufix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using eficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.

AB - Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an eficient and highly scalable blocking approach based on sufix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base sufix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using eficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.

KW - Blocking

KW - Record linkage

KW - Suffix arrays

UR - http://www.scopus.com/inward/record.url?scp=79952543891&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952543891&partnerID=8YFLogxK

U2 - 10.1145/1921632.1921635

DO - 10.1145/1921632.1921635

M3 - Article

VL - 5

JO - ACM Transactions on Knowledge Discovery from Data

JF - ACM Transactions on Knowledge Discovery from Data

SN - 1556-4681

IS - 2

M1 - 9

ER -