Generating concise entity matching rules

Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge Arnulfo Quiane Ruiz, Armando Solar-Lezama, Nan Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by conjunctions (Vee), disjunctions (Wedge), and negations (not). GBFs can generate more concise rules than traditional EM rules represented in disjunctive normal forms (DNFs). We use program synthesis, a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability. they can see and measure the conciseness of EM rules defined using GBFs; (2) Easy customization. they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance. they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating di.erent methods that discover EM rules, which will be released as an opensource tool on GitHub.

Original languageEnglish
Title of host publicationSIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1635-1638
Number of pages4
VolumePart F127746
ISBN (Electronic)9781450341974
DOIs
Publication statusPublished - 9 May 2017
Event2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017 - Chicago, United States
Duration: 14 May 201719 May 2017

Other

Other2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
CountryUnited States
CityChicago
Period14/5/1719/5/17

Fingerprint

Data integration
Learning systems
Cleaning
Specifications
Experiments
Statistical Models

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Singh, R., Meduri, V., Elmagarmid, A., Madden, S., Papotti, P., Quiane Ruiz, J. A., ... Tang, N. (2017). Generating concise entity matching rules. In SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data (Vol. Part F127746, pp. 1635-1638). Association for Computing Machinery. https://doi.org/10.1145/3035918.3058739

Generating concise entity matching rules. / Singh, Rohit; Meduri, Vamsi; Elmagarmid, Ahmed; Madden, Samuel; Papotti, Paolo; Quiane Ruiz, Jorge Arnulfo; Solar-Lezama, Armando; Tang, Nan.

SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. Vol. Part F127746 Association for Computing Machinery, 2017. p. 1635-1638.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Singh, R, Meduri, V, Elmagarmid, A, Madden, S, Papotti, P, Quiane Ruiz, JA, Solar-Lezama, A & Tang, N 2017, Generating concise entity matching rules. in SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. vol. Part F127746, Association for Computing Machinery, pp. 1635-1638, 2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017, Chicago, United States, 14/5/17. https://doi.org/10.1145/3035918.3058739
Singh R, Meduri V, Elmagarmid A, Madden S, Papotti P, Quiane Ruiz JA et al. Generating concise entity matching rules. In SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. Vol. Part F127746. Association for Computing Machinery. 2017. p. 1635-1638 https://doi.org/10.1145/3035918.3058739
Singh, Rohit ; Meduri, Vamsi ; Elmagarmid, Ahmed ; Madden, Samuel ; Papotti, Paolo ; Quiane Ruiz, Jorge Arnulfo ; Solar-Lezama, Armando ; Tang, Nan. / Generating concise entity matching rules. SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data. Vol. Part F127746 Association for Computing Machinery, 2017. pp. 1635-1638
@inproceedings{c857ddc0fd3d499f964f13d0e20430fe,
title = "Generating concise entity matching rules",
abstract = "Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by conjunctions (Vee), disjunctions (Wedge), and negations (not). GBFs can generate more concise rules than traditional EM rules represented in disjunctive normal forms (DNFs). We use program synthesis, a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability. they can see and measure the conciseness of EM rules defined using GBFs; (2) Easy customization. they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance. they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating di.erent methods that discover EM rules, which will be released as an opensource tool on GitHub.",
author = "Rohit Singh and Vamsi Meduri and Ahmed Elmagarmid and Samuel Madden and Paolo Papotti and {Quiane Ruiz}, {Jorge Arnulfo} and Armando Solar-Lezama and Nan Tang",
year = "2017",
month = "5",
day = "9",
doi = "10.1145/3035918.3058739",
language = "English",
volume = "Part F127746",
pages = "1635--1638",
booktitle = "SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Generating concise entity matching rules

AU - Singh, Rohit

AU - Meduri, Vamsi

AU - Elmagarmid, Ahmed

AU - Madden, Samuel

AU - Papotti, Paolo

AU - Quiane Ruiz, Jorge Arnulfo

AU - Solar-Lezama, Armando

AU - Tang, Nan

PY - 2017/5/9

Y1 - 2017/5/9

N2 - Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by conjunctions (Vee), disjunctions (Wedge), and negations (not). GBFs can generate more concise rules than traditional EM rules represented in disjunctive normal forms (DNFs). We use program synthesis, a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability. they can see and measure the conciseness of EM rules defined using GBFs; (2) Easy customization. they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance. they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating di.erent methods that discover EM rules, which will be released as an opensource tool on GitHub.

AB - Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by conjunctions (Vee), disjunctions (Wedge), and negations (not). GBFs can generate more concise rules than traditional EM rules represented in disjunctive normal forms (DNFs). We use program synthesis, a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability. they can see and measure the conciseness of EM rules defined using GBFs; (2) Easy customization. they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance. they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating di.erent methods that discover EM rules, which will be released as an opensource tool on GitHub.

UR - http://www.scopus.com/inward/record.url?scp=85021214599&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021214599&partnerID=8YFLogxK

U2 - 10.1145/3035918.3058739

DO - 10.1145/3035918.3058739

M3 - Conference contribution

VL - Part F127746

SP - 1635

EP - 1638

BT - SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data

PB - Association for Computing Machinery

ER -