Know your neighbors: Web spam detection using the web topology

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

199 Citations (Scopus)

Abstract

Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

Original languageEnglish
Title of host publicationProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
Pages423-430
Number of pages8
DOIs
Publication statusPublished - 30 Nov 2007
Externally publishedYes
Event30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 - Amsterdam, Netherlands
Duration: 23 Jul 200727 Jul 2007

Other

Other30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
CountryNetherlands
CityAmsterdam
Period23/7/0727/7/07

Fingerprint

Spam
Labels
Topology
Search engines
Web Graph
Classifiers
Search Engine
World Wide Web
Classifier
Websites
Vote
Incentives
Clustering
Tend
Prediction
Graph in graph theory
Demonstrate

Keywords

  • Content spam
  • Link spam
  • Web spam

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Applied Mathematics

Cite this

Castillo, C., Donato, D., Gionis, A., Murdock, V., & Silvestri, F. (2007). Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 (pp. 423-430) https://doi.org/10.1145/1277741.1277814

Know your neighbors : Web spam detection using the web topology. / Castillo, Carlos; Donato, Debora; Gionis, Aristides; Murdock, Vanessa; Silvestri, Fabrizio.

Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. p. 423-430.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Castillo, C, Donato, D, Gionis, A, Murdock, V & Silvestri, F 2007, Know your neighbors: Web spam detection using the web topology. in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. pp. 423-430, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07, Amsterdam, Netherlands, 23/7/07. https://doi.org/10.1145/1277741.1277814
Castillo C, Donato D, Gionis A, Murdock V, Silvestri F. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. p. 423-430 https://doi.org/10.1145/1277741.1277814
Castillo, Carlos ; Donato, Debora ; Gionis, Aristides ; Murdock, Vanessa ; Silvestri, Fabrizio. / Know your neighbors : Web spam detection using the web topology. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. pp. 423-430
@inproceedings{ad4efd6967a741569caac6e69a59a455,
title = "Know your neighbors: Web spam detection using the web topology",
abstract = "Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.",
keywords = "Content spam, Link spam, Web spam",
author = "Carlos Castillo and Debora Donato and Aristides Gionis and Vanessa Murdock and Fabrizio Silvestri",
year = "2007",
month = "11",
day = "30",
doi = "10.1145/1277741.1277814",
language = "English",
isbn = "1595935975",
pages = "423--430",
booktitle = "Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07",

}

TY - GEN

T1 - Know your neighbors

T2 - Web spam detection using the web topology

AU - Castillo, Carlos

AU - Donato, Debora

AU - Gionis, Aristides

AU - Murdock, Vanessa

AU - Silvestri, Fabrizio

PY - 2007/11/30

Y1 - 2007/11/30

N2 - Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

AB - Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

KW - Content spam

KW - Link spam

KW - Web spam

UR - http://www.scopus.com/inward/record.url?scp=36448992581&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36448992581&partnerID=8YFLogxK

U2 - 10.1145/1277741.1277814

DO - 10.1145/1277741.1277814

M3 - Conference contribution

AN - SCOPUS:36448992581

SN - 1595935975

SN - 9781595935977

SP - 423

EP - 430

BT - Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07

ER -