Automated gathering of Web information: An in-depth examination of agents interacting with search engines

Bernard Jansen, Tracy Mullen, Amanda Spink, Jan Pedersen

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

The Web has become a worldwide repository of information which individuals, companies, and organizations utilize to solve or address various information problems. Many of these Web users utilize automated agents to gather this information for them. Some assume that this approach represents a more sophisticated method of searching. However, there is little research investigating how Web agents search for online information. In this research, we first provide a classification for information agent using stages of information gathering, gathering approaches, and agent architecture. We then examine an implementation of one of the resulting classifications in detail, investigating how agents search for information on Web search engines, including the session, query, term, duration and frequency of interactions. For this temporal study, we analyzed three data sets of queries and page views from agents interacting with the Excite and AltaVista search engines from 1997 to 2002, examining approximately 900,000 queries submitted by over 3,000 agents. Findings include: (1) agent sessions are extremely interactive, with sometimes hundreds of interactions per second (2) agent queries are comparable to human searchers, with little use of query operators, (3) Web agents are searching for a relatively limited variety of information, wherein only 18% of the terms used are unique, and (4) the duration of agent-Web search engine interaction typically spans several hours. We discuss the implications for Web information agents and search engines.

Original languageEnglish
Pages (from-to)442-464
Number of pages23
JournalACM Transactions on Internet Technology
Volume6
Issue number4
DOIs
Publication statusPublished - 2006
Externally publishedYes

Fingerprint

Search engines
World Wide Web

Keywords

  • Agent searching
  • Search engines
  • Web searching

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Automated gathering of Web information : An in-depth examination of agents interacting with search engines. / Jansen, Bernard; Mullen, Tracy; Spink, Amanda; Pedersen, Jan.

In: ACM Transactions on Internet Technology, Vol. 6, No. 4, 2006, p. 442-464.

Research output: Contribution to journalArticle

@article{40755c52432748bfa5d90764bba425fc,
title = "Automated gathering of Web information: An in-depth examination of agents interacting with search engines",
abstract = "The Web has become a worldwide repository of information which individuals, companies, and organizations utilize to solve or address various information problems. Many of these Web users utilize automated agents to gather this information for them. Some assume that this approach represents a more sophisticated method of searching. However, there is little research investigating how Web agents search for online information. In this research, we first provide a classification for information agent using stages of information gathering, gathering approaches, and agent architecture. We then examine an implementation of one of the resulting classifications in detail, investigating how agents search for information on Web search engines, including the session, query, term, duration and frequency of interactions. For this temporal study, we analyzed three data sets of queries and page views from agents interacting with the Excite and AltaVista search engines from 1997 to 2002, examining approximately 900,000 queries submitted by over 3,000 agents. Findings include: (1) agent sessions are extremely interactive, with sometimes hundreds of interactions per second (2) agent queries are comparable to human searchers, with little use of query operators, (3) Web agents are searching for a relatively limited variety of information, wherein only 18{\%} of the terms used are unique, and (4) the duration of agent-Web search engine interaction typically spans several hours. We discuss the implications for Web information agents and search engines.",
keywords = "Agent searching, Search engines, Web searching",
author = "Bernard Jansen and Tracy Mullen and Amanda Spink and Jan Pedersen",
year = "2006",
doi = "10.1145/1183463.1183468",
language = "English",
volume = "6",
pages = "442--464",
journal = "ACM Transactions on Internet Technology",
issn = "1533-5399",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

TY - JOUR

T1 - Automated gathering of Web information

T2 - An in-depth examination of agents interacting with search engines

AU - Jansen, Bernard

AU - Mullen, Tracy

AU - Spink, Amanda

AU - Pedersen, Jan

PY - 2006

Y1 - 2006

N2 - The Web has become a worldwide repository of information which individuals, companies, and organizations utilize to solve or address various information problems. Many of these Web users utilize automated agents to gather this information for them. Some assume that this approach represents a more sophisticated method of searching. However, there is little research investigating how Web agents search for online information. In this research, we first provide a classification for information agent using stages of information gathering, gathering approaches, and agent architecture. We then examine an implementation of one of the resulting classifications in detail, investigating how agents search for information on Web search engines, including the session, query, term, duration and frequency of interactions. For this temporal study, we analyzed three data sets of queries and page views from agents interacting with the Excite and AltaVista search engines from 1997 to 2002, examining approximately 900,000 queries submitted by over 3,000 agents. Findings include: (1) agent sessions are extremely interactive, with sometimes hundreds of interactions per second (2) agent queries are comparable to human searchers, with little use of query operators, (3) Web agents are searching for a relatively limited variety of information, wherein only 18% of the terms used are unique, and (4) the duration of agent-Web search engine interaction typically spans several hours. We discuss the implications for Web information agents and search engines.

AB - The Web has become a worldwide repository of information which individuals, companies, and organizations utilize to solve or address various information problems. Many of these Web users utilize automated agents to gather this information for them. Some assume that this approach represents a more sophisticated method of searching. However, there is little research investigating how Web agents search for online information. In this research, we first provide a classification for information agent using stages of information gathering, gathering approaches, and agent architecture. We then examine an implementation of one of the resulting classifications in detail, investigating how agents search for information on Web search engines, including the session, query, term, duration and frequency of interactions. For this temporal study, we analyzed three data sets of queries and page views from agents interacting with the Excite and AltaVista search engines from 1997 to 2002, examining approximately 900,000 queries submitted by over 3,000 agents. Findings include: (1) agent sessions are extremely interactive, with sometimes hundreds of interactions per second (2) agent queries are comparable to human searchers, with little use of query operators, (3) Web agents are searching for a relatively limited variety of information, wherein only 18% of the terms used are unique, and (4) the duration of agent-Web search engine interaction typically spans several hours. We discuss the implications for Web information agents and search engines.

KW - Agent searching

KW - Search engines

KW - Web searching

UR - http://www.scopus.com/inward/record.url?scp=33845463355&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33845463355&partnerID=8YFLogxK

U2 - 10.1145/1183463.1183468

DO - 10.1145/1183463.1183468

M3 - Article

AN - SCOPUS:33845463355

VL - 6

SP - 442

EP - 464

JO - ACM Transactions on Internet Technology

JF - ACM Transactions on Internet Technology

SN - 1533-5399

IS - 4

ER -