The Web is a massive and interlinked collection of documents, built using a decentralized design to encourage the participation of many authors who publish information through a huge number of Web sites. Its characteristics are the result of the interaction between many organizations and individuals, and those interactions generate a large amount of diversity. This diversity means that several different topics are represented on the Web, and at the same time that the overall quality of pages and Web sites is very variable. The Web is very dynamic and is growing at a very fast pace, and even when some of its properties have been studied, there are several characteristics of it that are still not fully known. This article reports the results of an in-depth study over a large collection of Web pSSages. On September and October 2004 we downloaded more than 16 million Web pages from about 300,000 Web sites from the Web of Spain. We show the characteristics of this collection at three different granularity levels: Web pages, sites and domains. For each level, we analyze contents, links, and technologies, and present statistics and models. We found that some of the characteristics of this collection resemble the ones of the Web at large, while others are specific to the Web of Spain, or have not been studied in the past.
|Publication status||Published - 8 Dec 2005|
ASJC Scopus subject areas
- Library and Information Sciences