Improving researcher homepage classification with unlabeled data

Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on "non-homepages" present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: "How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?" We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for "learning a conforming pair of classifiers" that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset. Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions. Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.

Original languageEnglish
Article number17
JournalACM Transactions on the Web
Volume9
Issue number4
DOIs
Publication statusPublished - 1 Sep 2015
Externally publishedYes

Fingerprint

Websites
Classifiers
Feature extraction
Hash functions
Degradation

Keywords

  • Co-training
  • Conforming classifiers
  • Researcher homepage classification
  • Unlabeled data

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Das Gollapalli, S., Caragea, C., Mitra, P., & Giles, C. L. (2015). Improving researcher homepage classification with unlabeled data. ACM Transactions on the Web, 9(4), [17]. https://doi.org/10.1145/2767135

Improving researcher homepage classification with unlabeled data. / Das Gollapalli, Sujatha; Caragea, Cornelia; Mitra, Prasenjit; Giles, C. Lee.

In: ACM Transactions on the Web, Vol. 9, No. 4, 17, 01.09.2015.

Research output: Contribution to journalArticle

Das Gollapalli, S, Caragea, C, Mitra, P & Giles, CL 2015, 'Improving researcher homepage classification with unlabeled data', ACM Transactions on the Web, vol. 9, no. 4, 17. https://doi.org/10.1145/2767135
Das Gollapalli, Sujatha ; Caragea, Cornelia ; Mitra, Prasenjit ; Giles, C. Lee. / Improving researcher homepage classification with unlabeled data. In: ACM Transactions on the Web. 2015 ; Vol. 9, No. 4.
@article{4daed39d27fa4db3868d6ad8d45d93fa,
title = "Improving researcher homepage classification with unlabeled data",
abstract = "A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on {"}non-homepages{"} present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: {"}How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?{"} We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for {"}learning a conforming pair of classifiers{"} that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset. Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions. Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.",
keywords = "Co-training, Conforming classifiers, Researcher homepage classification, Unlabeled data",
author = "{Das Gollapalli}, Sujatha and Cornelia Caragea and Prasenjit Mitra and Giles, {C. Lee}",
year = "2015",
month = "9",
day = "1",
doi = "10.1145/2767135",
language = "English",
volume = "9",
journal = "ACM Transactions on the Web",
issn = "1559-1131",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

TY - JOUR

T1 - Improving researcher homepage classification with unlabeled data

AU - Das Gollapalli, Sujatha

AU - Caragea, Cornelia

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2015/9/1

Y1 - 2015/9/1

N2 - A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on "non-homepages" present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: "How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?" We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for "learning a conforming pair of classifiers" that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset. Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions. Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.

AB - A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on "non-homepages" present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: "How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?" We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for "learning a conforming pair of classifiers" that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset. Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions. Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.

KW - Co-training

KW - Conforming classifiers

KW - Researcher homepage classification

KW - Unlabeled data

UR - http://www.scopus.com/inward/record.url?scp=84946055462&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84946055462&partnerID=8YFLogxK

U2 - 10.1145/2767135

DO - 10.1145/2767135

M3 - Article

VL - 9

JO - ACM Transactions on the Web

JF - ACM Transactions on the Web

SN - 1559-1131

IS - 4

M1 - 17

ER -