On identifying academic homepages for digital libraries

Sujatha Das Gollapalli, C. Lee Giles, Prasenjit Mitra, Cornelia Caragea

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Citations (Scopus)

Abstract

Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.

Original languageEnglish
Title of host publicationProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Pages123-132
Number of pages10
DOIs
Publication statusPublished - 2011
Externally publishedYes
Event11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11 - Ottawa, ON
Duration: 13 Jun 201117 Jun 2011

Other

Other11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11
CityOttawa, ON
Period13/6/1117/6/11

Fingerprint

Digital libraries
Biometrics
Animals

Keywords

  • latent dirichlet allocation
  • mark-recapture techniques
  • topic mixtures
  • webpage classification

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Gollapalli, S. D., Giles, C. L., Mitra, P., & Caragea, C. (2011). On identifying academic homepages for digital libraries. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (pp. 123-132) https://doi.org/10.1145/1998076.1998099

On identifying academic homepages for digital libraries. / Gollapalli, Sujatha Das; Giles, C. Lee; Mitra, Prasenjit; Caragea, Cornelia.

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 2011. p. 123-132.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gollapalli, SD, Giles, CL, Mitra, P & Caragea, C 2011, On identifying academic homepages for digital libraries. in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. pp. 123-132, 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11, Ottawa, ON, 13/6/11. https://doi.org/10.1145/1998076.1998099
Gollapalli SD, Giles CL, Mitra P, Caragea C. On identifying academic homepages for digital libraries. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 2011. p. 123-132 https://doi.org/10.1145/1998076.1998099
Gollapalli, Sujatha Das ; Giles, C. Lee ; Mitra, Prasenjit ; Caragea, Cornelia. / On identifying academic homepages for digital libraries. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 2011. pp. 123-132
@inproceedings{382a5ff3f86645aeaee4b9d5c358a4a7,
title = "On identifying academic homepages for digital libraries",
abstract = "Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.",
keywords = "latent dirichlet allocation, mark-recapture techniques, topic mixtures, webpage classification",
author = "Gollapalli, {Sujatha Das} and Giles, {C. Lee} and Prasenjit Mitra and Cornelia Caragea",
year = "2011",
doi = "10.1145/1998076.1998099",
language = "English",
isbn = "9781450307444",
pages = "123--132",
booktitle = "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries",

}

TY - GEN

T1 - On identifying academic homepages for digital libraries

AU - Gollapalli, Sujatha Das

AU - Giles, C. Lee

AU - Mitra, Prasenjit

AU - Caragea, Cornelia

PY - 2011

Y1 - 2011

N2 - Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.

AB - Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.

KW - latent dirichlet allocation

KW - mark-recapture techniques

KW - topic mixtures

KW - webpage classification

UR - http://www.scopus.com/inward/record.url?scp=79960522068&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79960522068&partnerID=8YFLogxK

U2 - 10.1145/1998076.1998099

DO - 10.1145/1998076.1998099

M3 - Conference contribution

AN - SCOPUS:79960522068

SN - 9781450307444

SP - 123

EP - 132

BT - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries

ER -