Identifying value mappings for data integration

An unsupervised approach

Jaewoo Kang, Dongwon Lee, Prasenjit Mitra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. "Two-door front wheel drive" can be represented as "2DR-FWD" or "R2FD", or even as "CAR TYPE 3" in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages544-551
Number of pages8
Volume3806 LNCS
DOIs
Publication statusPublished - 2005
Externally publishedYes
Event6th International Conference on Web Information Systems Engineering, WISE 2005 - New York, NY
Duration: 20 Nov 200522 Nov 2005

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3806 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other6th International Conference on Web Information Systems Engineering, WISE 2005
CityNew York, NY
Period20/11/0522/11/05

Fingerprint

Data integration
Information Storage and Retrieval
Data Integration
Syntactics
Cleaning
Wheels
Semantics
Information Services
Information Integration
Distributed Networks
Wheel
Empirical Study
High Accuracy
Invariant
Object

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Kang, J., Lee, D., & Mitra, P. (2005). Identifying value mappings for data integration: An unsupervised approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3806 LNCS, pp. 544-551). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3806 LNCS). https://doi.org/10.1007/11581062_46

Identifying value mappings for data integration : An unsupervised approach. / Kang, Jaewoo; Lee, Dongwon; Mitra, Prasenjit.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3806 LNCS 2005. p. 544-551 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3806 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kang, J, Lee, D & Mitra, P 2005, Identifying value mappings for data integration: An unsupervised approach. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 3806 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3806 LNCS, pp. 544-551, 6th International Conference on Web Information Systems Engineering, WISE 2005, New York, NY, 20/11/05. https://doi.org/10.1007/11581062_46
Kang J, Lee D, Mitra P. Identifying value mappings for data integration: An unsupervised approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3806 LNCS. 2005. p. 544-551. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/11581062_46
Kang, Jaewoo ; Lee, Dongwon ; Mitra, Prasenjit. / Identifying value mappings for data integration : An unsupervised approach. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3806 LNCS 2005. pp. 544-551 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{903f26967bd04df39cc4b3f5a8dc4d22,
title = "Identifying value mappings for data integration: An unsupervised approach",
abstract = "The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. {"}Two-door front wheel drive{"} can be represented as {"}2DR-FWD{"} or {"}R2FD{"}, or even as {"}CAR TYPE 3{"} in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.",
author = "Jaewoo Kang and Dongwon Lee and Prasenjit Mitra",
year = "2005",
doi = "10.1007/11581062_46",
language = "English",
isbn = "3540300171",
volume = "3806 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "544--551",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Identifying value mappings for data integration

T2 - An unsupervised approach

AU - Kang, Jaewoo

AU - Lee, Dongwon

AU - Mitra, Prasenjit

PY - 2005

Y1 - 2005

N2 - The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. "Two-door front wheel drive" can be represented as "2DR-FWD" or "R2FD", or even as "CAR TYPE 3" in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

AB - The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. "Two-door front wheel drive" can be represented as "2DR-FWD" or "R2FD", or even as "CAR TYPE 3" in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

UR - http://www.scopus.com/inward/record.url?scp=33744788355&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33744788355&partnerID=8YFLogxK

U2 - 10.1007/11581062_46

DO - 10.1007/11581062_46

M3 - Conference contribution

SN - 3540300171

SN - 9783540300175

VL - 3806 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 544

EP - 551

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -