Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion

Lei Cen, Eduard C. Dragut, Luo Si, Mourad Ouzzani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

Entity disambiguation is an important step in many information retrieval applications. This paper proposes new research for entity disambiguation with the focus of name disambiguation in digital libraries. In particular, pairwise similarity is first learned for publications that share the same author name string (ANS) and then a novel Hierarchical Agglomerative Clustering approach with Adaptive Stopping Criterion (HACASC) is proposed to adaptively cluster a set of publications that share a same ANS to individual clusters of publications with different author identities. The HACASC approach utilizes a mixture of kernel ridge regressions to intelligently determine the threshold in clustering. This obtains more appropriate clustering granularity than non-adaptive stopping criterion. We conduct a large scale empirical study with a dataset of more than 2 million publication record pairs to demonstrate the advantage of the proposed HACASC approach.

Original languageEnglish
Title of host publicationSIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval
Pages741-744
Number of pages4
DOIs
Publication statusPublished - 2 Sep 2013
Event36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013 - Dublin, Ireland
Duration: 28 Jul 20131 Aug 2013

Other

Other36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013
CountryIreland
CityDublin
Period28/7/131/8/13

Fingerprint

Digital libraries
Information retrieval

Keywords

  • Author Disambiguation
  • Clustering

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Information Systems

Cite this

Cen, L., Dragut, E. C., Si, L., & Ouzzani, M. (2013). Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 741-744) https://doi.org/10.1145/2484028.2484157

Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. / Cen, Lei; Dragut, Eduard C.; Si, Luo; Ouzzani, Mourad.

SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013. p. 741-744.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cen, L, Dragut, EC, Si, L & Ouzzani, M 2013, Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. in SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 741-744, 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013, Dublin, Ireland, 28/7/13. https://doi.org/10.1145/2484028.2484157
Cen L, Dragut EC, Si L, Ouzzani M. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013. p. 741-744 https://doi.org/10.1145/2484028.2484157
Cen, Lei ; Dragut, Eduard C. ; Si, Luo ; Ouzzani, Mourad. / Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013. pp. 741-744
@inproceedings{9a9c231b7f5b4459b05a0107c49b4719,
title = "Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion",
abstract = "Entity disambiguation is an important step in many information retrieval applications. This paper proposes new research for entity disambiguation with the focus of name disambiguation in digital libraries. In particular, pairwise similarity is first learned for publications that share the same author name string (ANS) and then a novel Hierarchical Agglomerative Clustering approach with Adaptive Stopping Criterion (HACASC) is proposed to adaptively cluster a set of publications that share a same ANS to individual clusters of publications with different author identities. The HACASC approach utilizes a mixture of kernel ridge regressions to intelligently determine the threshold in clustering. This obtains more appropriate clustering granularity than non-adaptive stopping criterion. We conduct a large scale empirical study with a dataset of more than 2 million publication record pairs to demonstrate the advantage of the proposed HACASC approach.",
keywords = "Author Disambiguation, Clustering",
author = "Lei Cen and Dragut, {Eduard C.} and Luo Si and Mourad Ouzzani",
year = "2013",
month = "9",
day = "2",
doi = "10.1145/2484028.2484157",
language = "English",
isbn = "9781450320344",
pages = "741--744",
booktitle = "SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

TY - GEN

T1 - Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion

AU - Cen, Lei

AU - Dragut, Eduard C.

AU - Si, Luo

AU - Ouzzani, Mourad

PY - 2013/9/2

Y1 - 2013/9/2

N2 - Entity disambiguation is an important step in many information retrieval applications. This paper proposes new research for entity disambiguation with the focus of name disambiguation in digital libraries. In particular, pairwise similarity is first learned for publications that share the same author name string (ANS) and then a novel Hierarchical Agglomerative Clustering approach with Adaptive Stopping Criterion (HACASC) is proposed to adaptively cluster a set of publications that share a same ANS to individual clusters of publications with different author identities. The HACASC approach utilizes a mixture of kernel ridge regressions to intelligently determine the threshold in clustering. This obtains more appropriate clustering granularity than non-adaptive stopping criterion. We conduct a large scale empirical study with a dataset of more than 2 million publication record pairs to demonstrate the advantage of the proposed HACASC approach.

AB - Entity disambiguation is an important step in many information retrieval applications. This paper proposes new research for entity disambiguation with the focus of name disambiguation in digital libraries. In particular, pairwise similarity is first learned for publications that share the same author name string (ANS) and then a novel Hierarchical Agglomerative Clustering approach with Adaptive Stopping Criterion (HACASC) is proposed to adaptively cluster a set of publications that share a same ANS to individual clusters of publications with different author identities. The HACASC approach utilizes a mixture of kernel ridge regressions to intelligently determine the threshold in clustering. This obtains more appropriate clustering granularity than non-adaptive stopping criterion. We conduct a large scale empirical study with a dataset of more than 2 million publication record pairs to demonstrate the advantage of the proposed HACASC approach.

KW - Author Disambiguation

KW - Clustering

UR - http://www.scopus.com/inward/record.url?scp=84883099211&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883099211&partnerID=8YFLogxK

U2 - 10.1145/2484028.2484157

DO - 10.1145/2484028.2484157

M3 - Conference contribution

AN - SCOPUS:84883099211

SN - 9781450320344

SP - 741

EP - 744

BT - SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval

ER -