Unsupervised string transformation learning for entity consolidation

Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Ihab F. Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single 'golden record' for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods as well as Master Data Management (MDM) systems can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., 'Mary Lee' and 'Lee, Mary') and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way. Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019
PublisherIEEE Computer Society
Pages196-207
Number of pages12
ISBN (Electronic)9781538674741
DOIs
Publication statusPublished - 1 Apr 2019
Event35th IEEE International Conference on Data Engineering, ICDE 2019 - Macau, China
Duration: 8 Apr 201911 Apr 2019

Publication series

NameProceedings - International Conference on Data Engineering
Volume2019-April
ISSN (Print)1084-4627

Conference

Conference35th IEEE International Conference on Data Engineering, ICDE 2019
CountryChina
CityMacau
Period8/4/1911/4/19

Fingerprint

Data integration
Consolidation
Information management
Data fusion

Keywords

  • Data editing
  • Data integration
  • Entity consolidation
  • Program synthesis
  • String transformation

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Deng, D., Tao, W., Abedjan, Z., Elmagarmid, A., Ilyas, I. F., Li, G., ... Tang, N. (2019). Unsupervised string transformation learning for entity consolidation. In Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019 (pp. 196-207). [8731550] (Proceedings - International Conference on Data Engineering; Vol. 2019-April). IEEE Computer Society. https://doi.org/10.1109/ICDE.2019.00026

Unsupervised string transformation learning for entity consolidation. / Deng, Dong; Tao, Wenbo; Abedjan, Ziawasch; Elmagarmid, Ahmed; Ilyas, Ihab F.; Li, Guoliang; Madden, Samuel; Ouzzani, Mourad; Stonebraker, Michael; Tang, Nan.

Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019. IEEE Computer Society, 2019. p. 196-207 8731550 (Proceedings - International Conference on Data Engineering; Vol. 2019-April).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Deng, D, Tao, W, Abedjan, Z, Elmagarmid, A, Ilyas, IF, Li, G, Madden, S, Ouzzani, M, Stonebraker, M & Tang, N 2019, Unsupervised string transformation learning for entity consolidation. in Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019., 8731550, Proceedings - International Conference on Data Engineering, vol. 2019-April, IEEE Computer Society, pp. 196-207, 35th IEEE International Conference on Data Engineering, ICDE 2019, Macau, China, 8/4/19. https://doi.org/10.1109/ICDE.2019.00026
Deng D, Tao W, Abedjan Z, Elmagarmid A, Ilyas IF, Li G et al. Unsupervised string transformation learning for entity consolidation. In Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019. IEEE Computer Society. 2019. p. 196-207. 8731550. (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDE.2019.00026
Deng, Dong ; Tao, Wenbo ; Abedjan, Ziawasch ; Elmagarmid, Ahmed ; Ilyas, Ihab F. ; Li, Guoliang ; Madden, Samuel ; Ouzzani, Mourad ; Stonebraker, Michael ; Tang, Nan. / Unsupervised string transformation learning for entity consolidation. Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019. IEEE Computer Society, 2019. pp. 196-207 (Proceedings - International Conference on Data Engineering).
@inproceedings{7b6480b19af043c0ab05667894595f63,
title = "Unsupervised string transformation learning for entity consolidation",
abstract = "Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single 'golden record' for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods as well as Master Data Management (MDM) systems can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., 'Mary Lee' and 'Lee, Mary') and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way. Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75{\%} recall and 99.5{\%} precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool.",
keywords = "Data editing, Data integration, Entity consolidation, Program synthesis, String transformation",
author = "Dong Deng and Wenbo Tao and Ziawasch Abedjan and Ahmed Elmagarmid and Ilyas, {Ihab F.} and Guoliang Li and Samuel Madden and Mourad Ouzzani and Michael Stonebraker and Nan Tang",
year = "2019",
month = "4",
day = "1",
doi = "10.1109/ICDE.2019.00026",
language = "English",
series = "Proceedings - International Conference on Data Engineering",
publisher = "IEEE Computer Society",
pages = "196--207",
booktitle = "Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019",

}

TY - GEN

T1 - Unsupervised string transformation learning for entity consolidation

AU - Deng, Dong

AU - Tao, Wenbo

AU - Abedjan, Ziawasch

AU - Elmagarmid, Ahmed

AU - Ilyas, Ihab F.

AU - Li, Guoliang

AU - Madden, Samuel

AU - Ouzzani, Mourad

AU - Stonebraker, Michael

AU - Tang, Nan

PY - 2019/4/1

Y1 - 2019/4/1

N2 - Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single 'golden record' for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods as well as Master Data Management (MDM) systems can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., 'Mary Lee' and 'Lee, Mary') and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way. Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool.

AB - Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single 'golden record' for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods as well as Master Data Management (MDM) systems can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., 'Mary Lee' and 'Lee, Mary') and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way. Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool.

KW - Data editing

KW - Data integration

KW - Entity consolidation

KW - Program synthesis

KW - String transformation

UR - http://www.scopus.com/inward/record.url?scp=85066076601&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066076601&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2019.00026

DO - 10.1109/ICDE.2019.00026

M3 - Conference contribution

AN - SCOPUS:85066076601

T3 - Proceedings - International Conference on Data Engineering

SP - 196

EP - 207

BT - Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019

PB - IEEE Computer Society

ER -