A generalized topic modeling approach for automatic document annotation

Suppawong Tuarob, Line C. Pouchard, Prasenjit Mitra, C. Lee Giles

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

Original languageEnglish
Pages (from-to)111-128
Number of pages18
JournalInternational Journal on Digital Libraries
Volume16
Issue number2
DOIs
Publication statusPublished - 7 Mar 2015

Fingerprint

experiment
science
methodology
time
coherence

Keywords

  • Metadata annotation
  • Tag recommendation
  • Topic model

ASJC Scopus subject areas

  • Library and Information Sciences

Cite this

A generalized topic modeling approach for automatic document annotation. / Tuarob, Suppawong; Pouchard, Line C.; Mitra, Prasenjit; Giles, C. Lee.

In: International Journal on Digital Libraries, Vol. 16, No. 2, 07.03.2015, p. 111-128.

Research output: Contribution to journalArticle

Tuarob, Suppawong ; Pouchard, Line C. ; Mitra, Prasenjit ; Giles, C. Lee. / A generalized topic modeling approach for automatic document annotation. In: International Journal on Digital Libraries. 2015 ; Vol. 16, No. 2. pp. 111-128.
@article{9d7d8e17237b45fdbb033bed53ebe427,
title = "A generalized topic modeling approach for automatic document annotation",
abstract = "Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.",
keywords = "Metadata annotation, Tag recommendation, Topic model",
author = "Suppawong Tuarob and Pouchard, {Line C.} and Prasenjit Mitra and Giles, {C. Lee}",
year = "2015",
month = "3",
day = "7",
doi = "10.1007/s00799-015-0146-2",
language = "English",
volume = "16",
pages = "111--128",
journal = "International Journal on Digital Libraries",
issn = "1432-5012",
publisher = "Springer Verlag",
number = "2",

}

TY - JOUR

T1 - A generalized topic modeling approach for automatic document annotation

AU - Tuarob, Suppawong

AU - Pouchard, Line C.

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2015/3/7

Y1 - 2015/3/7

N2 - Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

AB - Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

KW - Metadata annotation

KW - Tag recommendation

KW - Topic model

UR - http://www.scopus.com/inward/record.url?scp=84929943959&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929943959&partnerID=8YFLogxK

U2 - 10.1007/s00799-015-0146-2

DO - 10.1007/s00799-015-0146-2

M3 - Article

AN - SCOPUS:84929943959

VL - 16

SP - 111

EP - 128

JO - International Journal on Digital Libraries

JF - International Journal on Digital Libraries

SN - 1432-5012

IS - 2

ER -