Discovering mis-categorized entities

Shuang Hao, Nan Tang, Guoliang Li, Jianhua Feng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Entity categorization-the process of grouping entities into categories for some specific purpose-is an important problem with a great many applications, such as Google Scholar and Amazon products. Unfortunately, in practice, many entities are mis-categorized. In this paper, we study the problem of discovering mis-categorized entities from a given group of entities. This problem is inherently hard: All entities within the same group have been 'well' categorized by state-of-The-Art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules, and discuss how to generate positive/negative rules. Extensive experimental results on two real-world datasets show the effectiveness of our solution.

Original languageEnglish
Title of host publicationProceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages413-424
Number of pages12
ISBN (Electronic)9781538655207
DOIs
Publication statusPublished - 24 Oct 2018
Event34th IEEE International Conference on Data Engineering, ICDE 2018 - Paris, France
Duration: 16 Apr 201819 Apr 2018

Other

Other34th IEEE International Conference on Data Engineering, ICDE 2018
CountryFrance
CityParis
Period16/4/1819/4/18

Fingerprint

Grouping
Amazon
Google Scholar
Rule-based

Keywords

  • mis categorized entity
  • Rule based framework
  • rule generation
  • Signature

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management
  • Hardware and Architecture

Cite this

Hao, S., Tang, N., Li, G., & Feng, J. (2018). Discovering mis-categorized entities. In Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 (pp. 413-424). [8509266] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDE.2018.00045

Discovering mis-categorized entities. / Hao, Shuang; Tang, Nan; Li, Guoliang; Feng, Jianhua.

Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 413-424 8509266.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hao, S, Tang, N, Li, G & Feng, J 2018, Discovering mis-categorized entities. in Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018., 8509266, Institute of Electrical and Electronics Engineers Inc., pp. 413-424, 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16/4/18. https://doi.org/10.1109/ICDE.2018.00045
Hao S, Tang N, Li G, Feng J. Discovering mis-categorized entities. In Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 413-424. 8509266 https://doi.org/10.1109/ICDE.2018.00045
Hao, Shuang ; Tang, Nan ; Li, Guoliang ; Feng, Jianhua. / Discovering mis-categorized entities. Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 413-424
@inproceedings{852de1f5446c4447915f9769a414b21e,
title = "Discovering mis-categorized entities",
abstract = "Entity categorization-the process of grouping entities into categories for some specific purpose-is an important problem with a great many applications, such as Google Scholar and Amazon products. Unfortunately, in practice, many entities are mis-categorized. In this paper, we study the problem of discovering mis-categorized entities from a given group of entities. This problem is inherently hard: All entities within the same group have been 'well' categorized by state-of-The-Art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules, and discuss how to generate positive/negative rules. Extensive experimental results on two real-world datasets show the effectiveness of our solution.",
keywords = "mis categorized entity, Rule based framework, rule generation, Signature",
author = "Shuang Hao and Nan Tang and Guoliang Li and Jianhua Feng",
year = "2018",
month = "10",
day = "24",
doi = "10.1109/ICDE.2018.00045",
language = "English",
pages = "413--424",
booktitle = "Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Discovering mis-categorized entities

AU - Hao, Shuang

AU - Tang, Nan

AU - Li, Guoliang

AU - Feng, Jianhua

PY - 2018/10/24

Y1 - 2018/10/24

N2 - Entity categorization-the process of grouping entities into categories for some specific purpose-is an important problem with a great many applications, such as Google Scholar and Amazon products. Unfortunately, in practice, many entities are mis-categorized. In this paper, we study the problem of discovering mis-categorized entities from a given group of entities. This problem is inherently hard: All entities within the same group have been 'well' categorized by state-of-The-Art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules, and discuss how to generate positive/negative rules. Extensive experimental results on two real-world datasets show the effectiveness of our solution.

AB - Entity categorization-the process of grouping entities into categories for some specific purpose-is an important problem with a great many applications, such as Google Scholar and Amazon products. Unfortunately, in practice, many entities are mis-categorized. In this paper, we study the problem of discovering mis-categorized entities from a given group of entities. This problem is inherently hard: All entities within the same group have been 'well' categorized by state-of-The-Art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules, and discuss how to generate positive/negative rules. Extensive experimental results on two real-world datasets show the effectiveness of our solution.

KW - mis categorized entity

KW - Rule based framework

KW - rule generation

KW - Signature

UR - http://www.scopus.com/inward/record.url?scp=85057128664&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057128664&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2018.00045

DO - 10.1109/ICDE.2018.00045

M3 - Conference contribution

SP - 413

EP - 424

BT - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -