Seeping semantics: linking datasets using word embeddings for data discovery

Raul Castro Fernandez, Essam Mansour, Abdulhakim Qahtan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.

Original languageEnglish
Title of host publicationProceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages989-1000
Number of pages12
ISBN (Electronic)9781538655207
DOIs
Publication statusPublished - 24 Oct 2018
Event34th IEEE International Conference on Data Engineering, ICDE 2018 - Paris, France
Duration: 16 Apr 201819 Apr 2018

Other

Other34th IEEE International Conference on Data Engineering, ICDE 2018
CountryFrance
CityParis
Period16/4/1819/4/18

Fingerprint

Semantics
Syntactics
Glossaries
Ontology
Industry
Personnel
Analysts
Semantic similarity
Graph
Quantitative evaluation
User studies
Leverage
Employees
World Wide Web
Data sources
An enterprise

Keywords

  • data discovery
  • Word embeddings

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management
  • Hardware and Architecture

Cite this

Castro Fernandez, R., Mansour, E., Qahtan, A., Elmagarmid, A., Ilyas, I., Madden, S., ... Tang, N. (2018). Seeping semantics: linking datasets using word embeddings for data discovery. In Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 (pp. 989-1000). [8509314] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDE.2018.00093

Seeping semantics : linking datasets using word embeddings for data discovery. / Castro Fernandez, Raul; Mansour, Essam; Qahtan, Abdulhakim; Elmagarmid, Ahmed; Ilyas, Ihab; Madden, Samuel; Ouzzani, Mourad; Stonebraker, Michael; Tang, Nan.

Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 989-1000 8509314.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Castro Fernandez, R, Mansour, E, Qahtan, A, Elmagarmid, A, Ilyas, I, Madden, S, Ouzzani, M, Stonebraker, M & Tang, N 2018, Seeping semantics: linking datasets using word embeddings for data discovery. in Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018., 8509314, Institute of Electrical and Electronics Engineers Inc., pp. 989-1000, 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16/4/18. https://doi.org/10.1109/ICDE.2018.00093
Castro Fernandez R, Mansour E, Qahtan A, Elmagarmid A, Ilyas I, Madden S et al. Seeping semantics: linking datasets using word embeddings for data discovery. In Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 989-1000. 8509314 https://doi.org/10.1109/ICDE.2018.00093
Castro Fernandez, Raul ; Mansour, Essam ; Qahtan, Abdulhakim ; Elmagarmid, Ahmed ; Ilyas, Ihab ; Madden, Samuel ; Ouzzani, Mourad ; Stonebraker, Michael ; Tang, Nan. / Seeping semantics : linking datasets using word embeddings for data discovery. Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 989-1000
@inproceedings{8d1a5733c5d14de0a67bc51ef5779f41,
title = "Seeping semantics: linking datasets using word embeddings for data discovery",
abstract = "Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.",
keywords = "data discovery, Word embeddings",
author = "{Castro Fernandez}, Raul and Essam Mansour and Abdulhakim Qahtan and Ahmed Elmagarmid and Ihab Ilyas and Samuel Madden and Mourad Ouzzani and Michael Stonebraker and Nan Tang",
year = "2018",
month = "10",
day = "24",
doi = "10.1109/ICDE.2018.00093",
language = "English",
pages = "989--1000",
booktitle = "Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Seeping semantics

T2 - linking datasets using word embeddings for data discovery

AU - Castro Fernandez, Raul

AU - Mansour, Essam

AU - Qahtan, Abdulhakim

AU - Elmagarmid, Ahmed

AU - Ilyas, Ihab

AU - Madden, Samuel

AU - Ouzzani, Mourad

AU - Stonebraker, Michael

AU - Tang, Nan

PY - 2018/10/24

Y1 - 2018/10/24

N2 - Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.

AB - Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.

KW - data discovery

KW - Word embeddings

UR - http://www.scopus.com/inward/record.url?scp=85057111346&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057111346&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2018.00093

DO - 10.1109/ICDE.2018.00093

M3 - Conference contribution

AN - SCOPUS:85057111346

SP - 989

EP - 1000

BT - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -