Query-time record linkage and fusion over Web databases

El Kindi Rezig, Eduard C. Dragut, Mourad Ouzzani, Ahmed Elmagarmid

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the 'ideal' representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
PublisherIEEE Computer Society
Pages42-53
Number of pages12
Volume2015-May
ISBN (Print)9781479979639
DOIs
Publication statusPublished - 26 May 2015
Event2015 31st IEEE International Conference on Data Engineering, ICDE 2015 - Seoul, Korea, Republic of
Duration: 13 Apr 201517 Apr 2015

Other

Other2015 31st IEEE International Conference on Data Engineering, ICDE 2015
CountryKorea, Republic of
CitySeoul
Period13/4/1517/4/15

Fingerprint

Fusion reactions
Experiments

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Rezig, E. K., Dragut, E. C., Ouzzani, M., & Elmagarmid, A. (2015). Query-time record linkage and fusion over Web databases. In Proceedings - International Conference on Data Engineering (Vol. 2015-May, pp. 42-53). [7113271] IEEE Computer Society. https://doi.org/10.1109/ICDE.2015.7113271

Query-time record linkage and fusion over Web databases. / Rezig, El Kindi; Dragut, Eduard C.; Ouzzani, Mourad; Elmagarmid, Ahmed.

Proceedings - International Conference on Data Engineering. Vol. 2015-May IEEE Computer Society, 2015. p. 42-53 7113271.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Rezig, EK, Dragut, EC, Ouzzani, M & Elmagarmid, A 2015, Query-time record linkage and fusion over Web databases. in Proceedings - International Conference on Data Engineering. vol. 2015-May, 7113271, IEEE Computer Society, pp. 42-53, 2015 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, Korea, Republic of, 13/4/15. https://doi.org/10.1109/ICDE.2015.7113271
Rezig EK, Dragut EC, Ouzzani M, Elmagarmid A. Query-time record linkage and fusion over Web databases. In Proceedings - International Conference on Data Engineering. Vol. 2015-May. IEEE Computer Society. 2015. p. 42-53. 7113271 https://doi.org/10.1109/ICDE.2015.7113271
Rezig, El Kindi ; Dragut, Eduard C. ; Ouzzani, Mourad ; Elmagarmid, Ahmed. / Query-time record linkage and fusion over Web databases. Proceedings - International Conference on Data Engineering. Vol. 2015-May IEEE Computer Society, 2015. pp. 42-53
@inproceedings{0b6357dd7e98446fb68290e378dc85ab,
title = "Query-time record linkage and fusion over Web databases",
abstract = "Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the 'ideal' representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.",
author = "Rezig, {El Kindi} and Dragut, {Eduard C.} and Mourad Ouzzani and Ahmed Elmagarmid",
year = "2015",
month = "5",
day = "26",
doi = "10.1109/ICDE.2015.7113271",
language = "English",
isbn = "9781479979639",
volume = "2015-May",
pages = "42--53",
booktitle = "Proceedings - International Conference on Data Engineering",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Query-time record linkage and fusion over Web databases

AU - Rezig, El Kindi

AU - Dragut, Eduard C.

AU - Ouzzani, Mourad

AU - Elmagarmid, Ahmed

PY - 2015/5/26

Y1 - 2015/5/26

N2 - Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the 'ideal' representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.

AB - Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the 'ideal' representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.

UR - http://www.scopus.com/inward/record.url?scp=84940834271&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84940834271&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2015.7113271

DO - 10.1109/ICDE.2015.7113271

M3 - Conference contribution

AN - SCOPUS:84940834271

SN - 9781479979639

VL - 2015-May

SP - 42

EP - 53

BT - Proceedings - International Conference on Data Engineering

PB - IEEE Computer Society

ER -