Researcher homepage classification using unlabeled data

Sujatha Das G, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on theWeb? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Original languageEnglish
Title of host publicationWWW 2013 - Proceedings of the 22nd International Conference on World Wide Web
Pages471-481
Number of pages11
Publication statusPublished - 2013
Externally publishedYes
Event22nd International Conference on World Wide Web, WWW 2013 - Rio de Janeiro, Brazil
Duration: 13 May 201317 May 2013

Other

Other22nd International Conference on World Wide Web, WWW 2013
CountryBrazil
CityRio de Janeiro
Period13/5/1317/5/13

Fingerprint

Classifiers
Websites
World Wide Web
Tuning

Keywords

  • Co-training
  • Consensus maximization
  • Gradient descent

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Das G, S., Caragea, C., Mitra, P., & Giles, C. L. (2013). Researcher homepage classification using unlabeled data. In WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web (pp. 471-481)

Researcher homepage classification using unlabeled data. / Das G, Sujatha; Caragea, Cornelia; Mitra, Prasenjit; Giles, C. Lee.

WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web. 2013. p. 471-481.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Das G, S, Caragea, C, Mitra, P & Giles, CL 2013, Researcher homepage classification using unlabeled data. in WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web. pp. 471-481, 22nd International Conference on World Wide Web, WWW 2013, Rio de Janeiro, Brazil, 13/5/13.
Das G S, Caragea C, Mitra P, Giles CL. Researcher homepage classification using unlabeled data. In WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web. 2013. p. 471-481
Das G, Sujatha ; Caragea, Cornelia ; Mitra, Prasenjit ; Giles, C. Lee. / Researcher homepage classification using unlabeled data. WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web. 2013. pp. 471-481
@inproceedings{0ac2d2d8a30f4155a9e720db37430f2c,
title = "Researcher homepage classification using unlabeled data",
abstract = "A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on theWeb? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying {"}irrelevant{"} pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for {"}learning a conforming pair of classifiers{"} using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make {"}similar{"} predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set. Copyright is held by the International World Wide Web Conference Committee (IW3C2).",
keywords = "Co-training, Consensus maximization, Gradient descent",
author = "{Das G}, Sujatha and Cornelia Caragea and Prasenjit Mitra and Giles, {C. Lee}",
year = "2013",
language = "English",
isbn = "9781450320351",
pages = "471--481",
booktitle = "WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web",

}

TY - GEN

T1 - Researcher homepage classification using unlabeled data

AU - Das G, Sujatha

AU - Caragea, Cornelia

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2013

Y1 - 2013

N2 - A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on theWeb? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

AB - A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on theWeb? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

KW - Co-training

KW - Consensus maximization

KW - Gradient descent

UR - http://www.scopus.com/inward/record.url?scp=84893147440&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84893147440&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781450320351

SP - 471

EP - 481

BT - WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web

ER -