Language variation as a context for information retrieval

Ahmed Abdelali, Jim Cowie, Hamdy S. Soliman

Research output: Contribution to journalConference article

Abstract

Speakers of widespread languages may encounter problems in information retrieval and document understanding when they access documents in the same language from another country. The work described here focuses on the development of resources to support improved document retrieval and understanding by users of Modern Standard Arabic (MSA). The lexicon of an Egyptian Arabic speaker and the lexicon of an Algerian Arabic speaker overlap, but there are many lexical tokens which are not shared, or which mean different things to the two speakers. These differences give us a context for information retrieval which can improve retrieval performance and also enhance document understanding after retrieval. The availability of a suitable corpus is a key for much objective research. In this paper we present the results of experiments in building a corpus for Modern Standard Arabic (MSA) using data available on the World Wide Web. We selected samples of online published newspapers from different Arabic countries. We demonstrate the completeness and the representativeness of this corpus using standard metrics and show its suitability for Language engineering experiments. The results of the experiments show that is possible to link an Arabic document to a specific region based on information induced from its vocabulary.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume151
Publication statusPublished - 1 Dec 2005
EventWorkshop on Context-Based Information Retrieval, CIR 2005, in Conjunction with CONTEXT 2005 - Paris, France
Duration: 5 Jul 20055 Jul 2005

    Fingerprint

ASJC Scopus subject areas

  • Computer Science(all)

Cite this