Synthesizing entity matching rules by examples

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge Arnulfo Quiané-Ruiz, Armando Solar-Lezama, Nan Tang

Research output: Contribution to journalConference article

12 Citations (Scopus)

Abstract

Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).

Original languageEnglish
Pages (from-to)189-202
Number of pages14
JournalProceedings of the VLDB Endowment
Volume11
DOIs
Publication statusPublished - 1 Jan 2018
Event44th International Conference on Very Large Data Bases, VLDB 2018 - Rio de Janeiro, Brazil
Duration: 27 Aug 201731 Aug 2017

Fingerprint

Decision trees
Data integration
Learning algorithms
Learning systems
Specifications
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Synthesizing entity matching rules by examples. / Singh, Rohit; Meduri, Venkata Vamsikrishna; Elmagarmid, Ahmed; Madden, Samuel; Papotti, Paolo; Quiané-Ruiz, Jorge Arnulfo; Solar-Lezama, Armando; Tang, Nan.

In: Proceedings of the VLDB Endowment, Vol. 11, 01.01.2018, p. 189-202.

Research output: Contribution to journalConference article

Singh, Rohit ; Meduri, Venkata Vamsikrishna ; Elmagarmid, Ahmed ; Madden, Samuel ; Papotti, Paolo ; Quiané-Ruiz, Jorge Arnulfo ; Solar-Lezama, Armando ; Tang, Nan. / Synthesizing entity matching rules by examples. In: Proceedings of the VLDB Endowment. 2018 ; Vol. 11. pp. 189-202.
@article{985d84dfba7b4890a75f60b709cb5f2a,
title = "Synthesizing entity matching rules by examples",
abstract = "Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).",
author = "Rohit Singh and Meduri, {Venkata Vamsikrishna} and Ahmed Elmagarmid and Samuel Madden and Paolo Papotti and Quian{\'e}-Ruiz, {Jorge Arnulfo} and Armando Solar-Lezama and Nan Tang",
year = "2018",
month = "1",
day = "1",
doi = "10.14778/3149193.3149199",
language = "English",
volume = "11",
pages = "189--202",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",

}

TY - JOUR

T1 - Synthesizing entity matching rules by examples

AU - Singh, Rohit

AU - Meduri, Venkata Vamsikrishna

AU - Elmagarmid, Ahmed

AU - Madden, Samuel

AU - Papotti, Paolo

AU - Quiané-Ruiz, Jorge Arnulfo

AU - Solar-Lezama, Armando

AU - Tang, Nan

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).

AB - Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).

UR - http://www.scopus.com/inward/record.url?scp=85074643202&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074643202&partnerID=8YFLogxK

U2 - 10.14778/3149193.3149199

DO - 10.14778/3149193.3149199

M3 - Conference article

AN - SCOPUS:85074643202

VL - 11

SP - 189

EP - 202

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

ER -