Link analysis for Web spam detection

Luca Becchetti, Carlos Castillo, Debora Donato, Ricardo Baeza-Yates, Stefano Leonardi

Research output: Contribution to journalArticle

72 Citations (Scopus)

Abstract

We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.

Original languageEnglish
Article number2
JournalACM Transactions on the Web
Volume2
Issue number1
DOIs
Publication statusPublished - 1 Feb 2008
Externally publishedYes

Fingerprint

Classifiers
Websites
Search engines
World Wide Web
Statistical methods
Statistics
Testing

Keywords

  • Adversarial information retrieval
  • Link analysis

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., & Leonardi, S. (2008). Link analysis for Web spam detection. ACM Transactions on the Web, 2(1), [2]. https://doi.org/10.1145/1326561.1326563

Link analysis for Web spam detection. / Becchetti, Luca; Castillo, Carlos; Donato, Debora; Baeza-Yates, Ricardo; Leonardi, Stefano.

In: ACM Transactions on the Web, Vol. 2, No. 1, 2, 01.02.2008.

Research output: Contribution to journalArticle

Becchetti, L, Castillo, C, Donato, D, Baeza-Yates, R & Leonardi, S 2008, 'Link analysis for Web spam detection', ACM Transactions on the Web, vol. 2, no. 1, 2. https://doi.org/10.1145/1326561.1326563
Becchetti L, Castillo C, Donato D, Baeza-Yates R, Leonardi S. Link analysis for Web spam detection. ACM Transactions on the Web. 2008 Feb 1;2(1). 2. https://doi.org/10.1145/1326561.1326563
Becchetti, Luca ; Castillo, Carlos ; Donato, Debora ; Baeza-Yates, Ricardo ; Leonardi, Stefano. / Link analysis for Web spam detection. In: ACM Transactions on the Web. 2008 ; Vol. 2, No. 1.
@article{921ae56810b7466d95d4671cffbc6667,
title = "Link analysis for Web spam detection",
abstract = "We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.",
keywords = "Adversarial information retrieval, Link analysis",
author = "Luca Becchetti and Carlos Castillo and Debora Donato and Ricardo Baeza-Yates and Stefano Leonardi",
year = "2008",
month = "2",
day = "1",
doi = "10.1145/1326561.1326563",
language = "English",
volume = "2",
journal = "ACM Transactions on the Web",
issn = "1559-1131",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Link analysis for Web spam detection

AU - Becchetti, Luca

AU - Castillo, Carlos

AU - Donato, Debora

AU - Baeza-Yates, Ricardo

AU - Leonardi, Stefano

PY - 2008/2/1

Y1 - 2008/2/1

N2 - We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.

AB - We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.

KW - Adversarial information retrieval

KW - Link analysis

UR - http://www.scopus.com/inward/record.url?scp=40949116672&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=40949116672&partnerID=8YFLogxK

U2 - 10.1145/1326561.1326563

DO - 10.1145/1326561.1326563

M3 - Article

AN - SCOPUS:40949116672

VL - 2

JO - ACM Transactions on the Web

JF - ACM Transactions on the Web

SN - 1559-1131

IS - 1

M1 - 2

ER -