Real datasets for file-sharing peer-to-peer systems

Shen Tat Goh, Panos Kalnis, Spiridon Bakiras, Kian Lee Tan

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

The fundamental drawback of unstructured peer-to-peer (P2P) networks is the flooding-based query processing protocol that seriously limits their scalability. As a result, a significant amount of research work has focused on designing efficient search protocols that reduce the overall communication cost. What is lacking, however, is the availability of real data, regarding the exact content of users' libraries and the queries that these users ask. Using trace-driven simulations will clearly generate more meaningful results and further illustrate the efficiency of a generic query processing protocol under a real-life scenario. Motivated by this fact, we developed a Gnutella-style probe and collected detailed data over a period of two months. They involve around 4,500 users and contain the exact files shared by each user, together with any available metadata (e.g., artist for songs) and information about the nodes (e.g., connection speed). We also collected the queries initiated by these users. After filtering, the data were organized in XML format and are available to researchers. Here, we analyze this dataset and present its statistical characteristics. Additionally, as a case study, we employ it to evaluate two recently proposed P2P searching techniques.

Original languageEnglish
Pages (from-to)201-213
Number of pages13
JournalLecture Notes in Computer Science
Volume3453
Publication statusPublished - 2005
Externally publishedYes

Fingerprint

Peer-to-peer Systems
Sharing
Query processing
Network protocols
Query Processing
Peer to peer networks
Metadata
XML
Scalability
Query
Availability
Peer-to-peer (P2P)
Peer-to-peer Networks
P2P Network
Communication Cost
Flooding
Communication
Probe
Filtering
Trace

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Real datasets for file-sharing peer-to-peer systems. / Goh, Shen Tat; Kalnis, Panos; Bakiras, Spiridon; Tan, Kian Lee.

In: Lecture Notes in Computer Science, Vol. 3453, 2005, p. 201-213.

Research output: Contribution to journalArticle

Goh, ST, Kalnis, P, Bakiras, S & Tan, KL 2005, 'Real datasets for file-sharing peer-to-peer systems', Lecture Notes in Computer Science, vol. 3453, pp. 201-213.
Goh, Shen Tat ; Kalnis, Panos ; Bakiras, Spiridon ; Tan, Kian Lee. / Real datasets for file-sharing peer-to-peer systems. In: Lecture Notes in Computer Science. 2005 ; Vol. 3453. pp. 201-213.
@article{c56c94ff68db4beb830f78882e1be98b,
title = "Real datasets for file-sharing peer-to-peer systems",
abstract = "The fundamental drawback of unstructured peer-to-peer (P2P) networks is the flooding-based query processing protocol that seriously limits their scalability. As a result, a significant amount of research work has focused on designing efficient search protocols that reduce the overall communication cost. What is lacking, however, is the availability of real data, regarding the exact content of users' libraries and the queries that these users ask. Using trace-driven simulations will clearly generate more meaningful results and further illustrate the efficiency of a generic query processing protocol under a real-life scenario. Motivated by this fact, we developed a Gnutella-style probe and collected detailed data over a period of two months. They involve around 4,500 users and contain the exact files shared by each user, together with any available metadata (e.g., artist for songs) and information about the nodes (e.g., connection speed). We also collected the queries initiated by these users. After filtering, the data were organized in XML format and are available to researchers. Here, we analyze this dataset and present its statistical characteristics. Additionally, as a case study, we employ it to evaluate two recently proposed P2P searching techniques.",
author = "Goh, {Shen Tat} and Panos Kalnis and Spiridon Bakiras and Tan, {Kian Lee}",
year = "2005",
language = "English",
volume = "3453",
pages = "201--213",
journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Real datasets for file-sharing peer-to-peer systems

AU - Goh, Shen Tat

AU - Kalnis, Panos

AU - Bakiras, Spiridon

AU - Tan, Kian Lee

PY - 2005

Y1 - 2005

N2 - The fundamental drawback of unstructured peer-to-peer (P2P) networks is the flooding-based query processing protocol that seriously limits their scalability. As a result, a significant amount of research work has focused on designing efficient search protocols that reduce the overall communication cost. What is lacking, however, is the availability of real data, regarding the exact content of users' libraries and the queries that these users ask. Using trace-driven simulations will clearly generate more meaningful results and further illustrate the efficiency of a generic query processing protocol under a real-life scenario. Motivated by this fact, we developed a Gnutella-style probe and collected detailed data over a period of two months. They involve around 4,500 users and contain the exact files shared by each user, together with any available metadata (e.g., artist for songs) and information about the nodes (e.g., connection speed). We also collected the queries initiated by these users. After filtering, the data were organized in XML format and are available to researchers. Here, we analyze this dataset and present its statistical characteristics. Additionally, as a case study, we employ it to evaluate two recently proposed P2P searching techniques.

AB - The fundamental drawback of unstructured peer-to-peer (P2P) networks is the flooding-based query processing protocol that seriously limits their scalability. As a result, a significant amount of research work has focused on designing efficient search protocols that reduce the overall communication cost. What is lacking, however, is the availability of real data, regarding the exact content of users' libraries and the queries that these users ask. Using trace-driven simulations will clearly generate more meaningful results and further illustrate the efficiency of a generic query processing protocol under a real-life scenario. Motivated by this fact, we developed a Gnutella-style probe and collected detailed data over a period of two months. They involve around 4,500 users and contain the exact files shared by each user, together with any available metadata (e.g., artist for songs) and information about the nodes (e.g., connection speed). We also collected the queries initiated by these users. After filtering, the data were organized in XML format and are available to researchers. Here, we analyze this dataset and present its statistical characteristics. Additionally, as a case study, we employ it to evaluate two recently proposed P2P searching techniques.

UR - http://www.scopus.com/inward/record.url?scp=24644440658&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=24644440658&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:24644440658

VL - 3453

SP - 201

EP - 213

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SN - 0302-9743

ER -