Extracting researcher metadata with labeled features

Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

Original languageEnglish
Title of host publicationSIAM International Conference on Data Mining 2014, SDM 2014
PublisherSociety for Industrial and Applied Mathematics Publications
Pages740-748
Number of pages9
Volume2
ISBN (Print)9781510811515
DOIs
Publication statusPublished - 2014
Externally publishedYes
Event14th SIAM International Conference on Data Mining, SDM 2014 - Philadelphia, United States
Duration: 24 Apr 201426 Apr 2014

Other

Other14th SIAM International Conference on Data Mining, SDM 2014
CountryUnited States
CityPhiladelphia
Period24/4/1426/4/14

    Fingerprint

Keywords

  • Conditional random fields
  • Feature labeling
  • Metadata extraction

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Cite this

Das Gollapalli, S., Qi, Y., Mitra, P., & Giles, C. L. (2014). Extracting researcher metadata with labeled features. In SIAM International Conference on Data Mining 2014, SDM 2014 (Vol. 2, pp. 740-748). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611973440.85