A comparison of approaches for measuring cross-lingual similarity of wikipedia articles

Alberto Barrón-Cedeño, Monica Lestari Paramita, Paul Clough, Paolo Rosso

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).

Original languageEnglish
Title of host publicationAdvances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014, Proceedings
PublisherSpringer Verlag
Pages424-429
Number of pages6
ISBN (Print)9783319060279
DOIs
Publication statusPublished - 1 Jan 2014
Event36th European Conference on Information Retrieval, ECIR 2014 - Amsterdam, Netherlands
Duration: 13 Apr 201416 Apr 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8416 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference36th European Conference on Information Retrieval, ECIR 2014
CountryNetherlands
CityAmsterdam
Period13/4/1416/4/14

    Fingerprint

Keywords

  • Cross-Lingual Similarity
  • Wikipedia

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Barrón-Cedeño, A., Paramita, M. L., Clough, P., & Rosso, P. (2014). A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014, Proceedings (pp. 424-429). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8416 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-06028-6_36