A real-time heuristic-based unsupervised method for name disambiguation in digital libraries

Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

This paper addresses the problem of name disambiguation in the context of digital libraries that administer bibliographic citations. The problem occurs when multiple authors share a common name or when multiple name variations for an author appear in citation records. Name disambiguation is not a trivial task, and most digital libraries do not provide an efficient way to accurately identify the citation records for an author. Furthermore, lack of complete meta-data information in digital libraries hinders the development of a generic algorithm that can be applicable to any dataset. We propose a heuristic-based, unsupervised and adaptive method that also examines users' interactions in order to include users' feedback in the disambiguation process. Moreover, the method exploits important features associated with author and citation records, such as co-authors, affiliation, publication title, venue, etc., creating a multilayered hierarchical clustering algorithm which transforms itself according to the available information, and forms clusters of unambiguous records. Our experiments on a set of researchers' names considered to be highly ambiguous produced high precision and recall results, and decisively affirmed the viability of our algorithm.

Original languageEnglish
JournalD-Lib Magazine
Volume19
Issue number9-10
DOIs
Publication statusPublished - 1 Sep 2013

Fingerprint

heuristics
available information
time
lack
experiment
interaction

Keywords

  • Bibliographic data
  • Digital libraries
  • Name disambiguation

ASJC Scopus subject areas

  • Library and Information Sciences

Cite this

A real-time heuristic-based unsupervised method for name disambiguation in digital libraries. / Imran, Muhammad; Gillani, Syed Zeeshan Haider; Marchese, Maurizio.

In: D-Lib Magazine, Vol. 19, No. 9-10, 01.09.2013.

Research output: Contribution to journalArticle

Imran, Muhammad ; Gillani, Syed Zeeshan Haider ; Marchese, Maurizio. / A real-time heuristic-based unsupervised method for name disambiguation in digital libraries. In: D-Lib Magazine. 2013 ; Vol. 19, No. 9-10.
@article{c65ef26f2d9f4955b5d79f18b37cfadf,
title = "A real-time heuristic-based unsupervised method for name disambiguation in digital libraries",
abstract = "This paper addresses the problem of name disambiguation in the context of digital libraries that administer bibliographic citations. The problem occurs when multiple authors share a common name or when multiple name variations for an author appear in citation records. Name disambiguation is not a trivial task, and most digital libraries do not provide an efficient way to accurately identify the citation records for an author. Furthermore, lack of complete meta-data information in digital libraries hinders the development of a generic algorithm that can be applicable to any dataset. We propose a heuristic-based, unsupervised and adaptive method that also examines users' interactions in order to include users' feedback in the disambiguation process. Moreover, the method exploits important features associated with author and citation records, such as co-authors, affiliation, publication title, venue, etc., creating a multilayered hierarchical clustering algorithm which transforms itself according to the available information, and forms clusters of unambiguous records. Our experiments on a set of researchers' names considered to be highly ambiguous produced high precision and recall results, and decisively affirmed the viability of our algorithm.",
keywords = "Bibliographic data, Digital libraries, Name disambiguation",
author = "Muhammad Imran and Gillani, {Syed Zeeshan Haider} and Maurizio Marchese",
year = "2013",
month = "9",
day = "1",
doi = "10.1045/september2013-imran",
language = "English",
volume = "19",
journal = "D-Lib Magazine",
issn = "1082-9873",
publisher = "Corporation for National Research Initiatives",
number = "9-10",

}

TY - JOUR

T1 - A real-time heuristic-based unsupervised method for name disambiguation in digital libraries

AU - Imran, Muhammad

AU - Gillani, Syed Zeeshan Haider

AU - Marchese, Maurizio

PY - 2013/9/1

Y1 - 2013/9/1

N2 - This paper addresses the problem of name disambiguation in the context of digital libraries that administer bibliographic citations. The problem occurs when multiple authors share a common name or when multiple name variations for an author appear in citation records. Name disambiguation is not a trivial task, and most digital libraries do not provide an efficient way to accurately identify the citation records for an author. Furthermore, lack of complete meta-data information in digital libraries hinders the development of a generic algorithm that can be applicable to any dataset. We propose a heuristic-based, unsupervised and adaptive method that also examines users' interactions in order to include users' feedback in the disambiguation process. Moreover, the method exploits important features associated with author and citation records, such as co-authors, affiliation, publication title, venue, etc., creating a multilayered hierarchical clustering algorithm which transforms itself according to the available information, and forms clusters of unambiguous records. Our experiments on a set of researchers' names considered to be highly ambiguous produced high precision and recall results, and decisively affirmed the viability of our algorithm.

AB - This paper addresses the problem of name disambiguation in the context of digital libraries that administer bibliographic citations. The problem occurs when multiple authors share a common name or when multiple name variations for an author appear in citation records. Name disambiguation is not a trivial task, and most digital libraries do not provide an efficient way to accurately identify the citation records for an author. Furthermore, lack of complete meta-data information in digital libraries hinders the development of a generic algorithm that can be applicable to any dataset. We propose a heuristic-based, unsupervised and adaptive method that also examines users' interactions in order to include users' feedback in the disambiguation process. Moreover, the method exploits important features associated with author and citation records, such as co-authors, affiliation, publication title, venue, etc., creating a multilayered hierarchical clustering algorithm which transforms itself according to the available information, and forms clusters of unambiguous records. Our experiments on a set of researchers' names considered to be highly ambiguous produced high precision and recall results, and decisively affirmed the viability of our algorithm.

KW - Bibliographic data

KW - Digital libraries

KW - Name disambiguation

UR - http://www.scopus.com/inward/record.url?scp=84886261928&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84886261928&partnerID=8YFLogxK

U2 - 10.1045/september2013-imran

DO - 10.1045/september2013-imran

M3 - Article

VL - 19

JO - D-Lib Magazine

JF - D-Lib Magazine

SN - 1082-9873

IS - 9-10

ER -