LinkBased characterization and detection of web spam

Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, Ricardo Baezayates

Research output: Chapter in Book/Report/Conference proceedingConference contribution

87 Citations (Scopus)

Abstract

We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several met- rics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.

Original languageEnglish
Title of host publicationProceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006
Pages1-8
Number of pages8
Publication statusPublished - 1 Dec 2006
Externally publishedYes
Event2nd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006 - Seattle, WA, United States
Duration: 10 Aug 200610 Aug 2006

Other

Other2nd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006
CountryUnited States
CitySeattle, WA
Period10/8/0610/8/06

Fingerprint

Classifiers
Websites
Statistical methods

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Cite this

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baezayates, R. (2006). LinkBased characterization and detection of web spam. In Proceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006 (pp. 1-8)

LinkBased characterization and detection of web spam. / Becchetti, Luca; Castillo, Carlos; Donato, Debora; Leonardi, Stefano; Baezayates, Ricardo.

Proceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006. 2006. p. 1-8.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Becchetti, L, Castillo, C, Donato, D, Leonardi, S & Baezayates, R 2006, LinkBased characterization and detection of web spam. in Proceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006. pp. 1-8, 2nd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, Seattle, WA, United States, 10/8/06.
Becchetti L, Castillo C, Donato D, Leonardi S, Baezayates R. LinkBased characterization and detection of web spam. In Proceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006. 2006. p. 1-8
Becchetti, Luca ; Castillo, Carlos ; Donato, Debora ; Leonardi, Stefano ; Baezayates, Ricardo. / LinkBased characterization and detection of web spam. Proceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006. 2006. pp. 1-8
@inproceedings{3f72afa3b096496f9244e16d1b81f55e,
title = "LinkBased characterization and detection of web spam",
abstract = "We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several met- rics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4{\%} of the Web spam in our sample, with only 1.1{\%} of false positives.",
author = "Luca Becchetti and Carlos Castillo and Debora Donato and Stefano Leonardi and Ricardo Baezayates",
year = "2006",
month = "12",
day = "1",
language = "English",
pages = "1--8",
booktitle = "Proceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006",

}

TY - GEN

T1 - LinkBased characterization and detection of web spam

AU - Becchetti, Luca

AU - Castillo, Carlos

AU - Donato, Debora

AU - Leonardi, Stefano

AU - Baezayates, Ricardo

PY - 2006/12/1

Y1 - 2006/12/1

N2 - We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several met- rics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.

AB - We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several met- rics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.

UR - http://www.scopus.com/inward/record.url?scp=84876833271&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876833271&partnerID=8YFLogxK

M3 - Conference contribution

SP - 1

EP - 8

BT - Proceedings of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2006 - 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR 2006

ER -