Extracting researcher metadata with labeled features

Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

Original languageEnglish
Title of host publicationSIAM International Conference on Data Mining 2014, SDM 2014
PublisherSociety for Industrial and Applied Mathematics Publications
Pages740-748
Number of pages9
Volume2
ISBN (Print)9781510811515
DOIs
Publication statusPublished - 2014
Externally publishedYes
Event14th SIAM International Conference on Data Mining, SDM 2014 - Philadelphia, United States
Duration: 24 Apr 201426 Apr 2014

Other

Other14th SIAM International Conference on Data Mining, SDM 2014
CountryUnited States
CityPhiladelphia
Period24/4/1426/4/14

Fingerprint

Metadata
Labeling
Digital libraries
Glossaries
Learning systems
Experiments

Keywords

  • Conditional random fields
  • Feature labeling
  • Metadata extraction

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Cite this

Das Gollapalli, S., Qi, Y., Mitra, P., & Giles, C. L. (2014). Extracting researcher metadata with labeled features. In SIAM International Conference on Data Mining 2014, SDM 2014 (Vol. 2, pp. 740-748). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611973440.85

Extracting researcher metadata with labeled features. / Das Gollapalli, Sujatha; Qi, Yanjun; Mitra, Prasenjit; Giles, C. Lee.

SIAM International Conference on Data Mining 2014, SDM 2014. Vol. 2 Society for Industrial and Applied Mathematics Publications, 2014. p. 740-748.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Das Gollapalli, S, Qi, Y, Mitra, P & Giles, CL 2014, Extracting researcher metadata with labeled features. in SIAM International Conference on Data Mining 2014, SDM 2014. vol. 2, Society for Industrial and Applied Mathematics Publications, pp. 740-748, 14th SIAM International Conference on Data Mining, SDM 2014, Philadelphia, United States, 24/4/14. https://doi.org/10.1137/1.9781611973440.85
Das Gollapalli S, Qi Y, Mitra P, Giles CL. Extracting researcher metadata with labeled features. In SIAM International Conference on Data Mining 2014, SDM 2014. Vol. 2. Society for Industrial and Applied Mathematics Publications. 2014. p. 740-748 https://doi.org/10.1137/1.9781611973440.85
Das Gollapalli, Sujatha ; Qi, Yanjun ; Mitra, Prasenjit ; Giles, C. Lee. / Extracting researcher metadata with labeled features. SIAM International Conference on Data Mining 2014, SDM 2014. Vol. 2 Society for Industrial and Applied Mathematics Publications, 2014. pp. 740-748
@inproceedings{a2a7fd7de1ea4fb697d05efc81da28ff,
title = "Extracting researcher metadata with labeled features",
abstract = "Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45{\%} relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9{\%}.",
keywords = "Conditional random fields, Feature labeling, Metadata extraction",
author = "{Das Gollapalli}, Sujatha and Yanjun Qi and Prasenjit Mitra and Giles, {C. Lee}",
year = "2014",
doi = "10.1137/1.9781611973440.85",
language = "English",
isbn = "9781510811515",
volume = "2",
pages = "740--748",
booktitle = "SIAM International Conference on Data Mining 2014, SDM 2014",
publisher = "Society for Industrial and Applied Mathematics Publications",

}

TY - GEN

T1 - Extracting researcher metadata with labeled features

AU - Das Gollapalli, Sujatha

AU - Qi, Yanjun

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2014

Y1 - 2014

N2 - Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

AB - Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

KW - Conditional random fields

KW - Feature labeling

KW - Metadata extraction

UR - http://www.scopus.com/inward/record.url?scp=84959872994&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959872994&partnerID=8YFLogxK

U2 - 10.1137/1.9781611973440.85

DO - 10.1137/1.9781611973440.85

M3 - Conference contribution

SN - 9781510811515

VL - 2

SP - 740

EP - 748

BT - SIAM International Conference on Data Mining 2014, SDM 2014

PB - Society for Industrial and Applied Mathematics Publications

ER -