From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling

Sen Xu, Anuj Jaiswal, Xiao Zhang, Alexander Klippel, Prasenjit Mitra, Alan Maceachren

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.

Original languageEnglish
Title of host publicationCEUR Workshop Proceedings
Pages49-52
Number of pages4
Volume620
Publication statusPublished - 2010
Externally publishedYes
EventWorkshop on Computational Models of Spatial Language Interpretation at Spatial Cognition 2010, COSLI 2010 - Portland, OR, United States
Duration: 15 Aug 201015 Aug 2010

Other

OtherWorkshop on Computational Models of Spatial Language Interpretation at Spatial Cognition 2010, COSLI 2010
CountryUnited States
CityPortland, OR
Period15/8/1015/8/10

Fingerprint

Linguistics
World Wide Web
Sampling
Hotels
Websites
Semantics

Keywords

  • Cardinal directions
  • Geo-referenced web sampling
  • Regional linguistic variation
  • Spatial language analysis
  • Volunteered spatial information

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Xu, S., Jaiswal, A., Zhang, X., Klippel, A., Mitra, P., & Maceachren, A. (2010). From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling. In CEUR Workshop Proceedings (Vol. 620, pp. 49-52)

From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling. / Xu, Sen; Jaiswal, Anuj; Zhang, Xiao; Klippel, Alexander; Mitra, Prasenjit; Maceachren, Alan.

CEUR Workshop Proceedings. Vol. 620 2010. p. 49-52.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xu, S, Jaiswal, A, Zhang, X, Klippel, A, Mitra, P & Maceachren, A 2010, From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling. in CEUR Workshop Proceedings. vol. 620, pp. 49-52, Workshop on Computational Models of Spatial Language Interpretation at Spatial Cognition 2010, COSLI 2010, Portland, OR, United States, 15/8/10.
Xu S, Jaiswal A, Zhang X, Klippel A, Mitra P, Maceachren A. From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling. In CEUR Workshop Proceedings. Vol. 620. 2010. p. 49-52
Xu, Sen ; Jaiswal, Anuj ; Zhang, Xiao ; Klippel, Alexander ; Mitra, Prasenjit ; Maceachren, Alan. / From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling. CEUR Workshop Proceedings. Vol. 620 2010. pp. 49-52
@inproceedings{3c3946ddb6254f6bae64f4b0292b7ab9,
title = "From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling",
abstract = "How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.",
keywords = "Cardinal directions, Geo-referenced web sampling, Regional linguistic variation, Spatial language analysis, Volunteered spatial information",
author = "Sen Xu and Anuj Jaiswal and Xiao Zhang and Alexander Klippel and Prasenjit Mitra and Alan Maceachren",
year = "2010",
language = "English",
volume = "620",
pages = "49--52",
booktitle = "CEUR Workshop Proceedings",

}

TY - GEN

T1 - From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling

AU - Xu, Sen

AU - Jaiswal, Anuj

AU - Zhang, Xiao

AU - Klippel, Alexander

AU - Mitra, Prasenjit

AU - Maceachren, Alan

PY - 2010

Y1 - 2010

N2 - How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.

AB - How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.

KW - Cardinal directions

KW - Geo-referenced web sampling

KW - Regional linguistic variation

KW - Spatial language analysis

KW - Volunteered spatial information

UR - http://www.scopus.com/inward/record.url?scp=84889004553&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889004553&partnerID=8YFLogxK

M3 - Conference contribution

VL - 620

SP - 49

EP - 52

BT - CEUR Workshop Proceedings

ER -