Distributed representations of tuples for entity resolution

Research output: Contribution to journalConference article

8 Citations (Scopus)

Abstract

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

Original languageEnglish
Pages (from-to)1454-1467
Number of pages14
JournalProceedings of the VLDB Endowment
Volume11
Issue number11
DOIs
Publication statusPublished - 1 Jan 2017
Event44th International Conference on Very Large Data Bases, VLDB 2018 - Rio de Janeiro, Brazil
Duration: 27 Aug 201731 Aug 2017

Fingerprint

Recurrent neural networks
Labeling
Tuning
Chemical analysis
Long short-term memory
Deep learning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Distributed representations of tuples for entity resolution. / Ebraheem, Muhammad; Thirumuruganathan, Saravanan; Rayhan Joty, Shafiq; Ouzzani, Mourad; Tang, Nan.

In: Proceedings of the VLDB Endowment, Vol. 11, No. 11, 01.01.2017, p. 1454-1467.

Research output: Contribution to journalConference article

@article{d14a564f47e842089905b05bb84d7749,
title = "Distributed representations of tuples for entity resolution",
abstract = "Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.",
author = "Muhammad Ebraheem and Saravanan Thirumuruganathan and {Rayhan Joty}, Shafiq and Mourad Ouzzani and Nan Tang",
year = "2017",
month = "1",
day = "1",
doi = "10.14778/3236187.3236198",
language = "English",
volume = "11",
pages = "1454--1467",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "11",

}

TY - JOUR

T1 - Distributed representations of tuples for entity resolution

AU - Ebraheem, Muhammad

AU - Thirumuruganathan, Saravanan

AU - Rayhan Joty, Shafiq

AU - Ouzzani, Mourad

AU - Tang, Nan

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

AB - Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

UR - http://www.scopus.com/inward/record.url?scp=85058879102&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058879102&partnerID=8YFLogxK

U2 - 10.14778/3236187.3236198

DO - 10.14778/3236187.3236198

M3 - Conference article

VL - 11

SP - 1454

EP - 1467

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 11

ER -