A comparison of approaches for measuring cross-lingual similarity of wikipedia articles

Alberto Barron, Monica Lestari Paramita, Paul Clough, Paolo Rosso

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages424-429
Number of pages6
Volume8416 LNCS
ISBN (Print)9783319060279
DOIs
Publication statusPublished - 2014
Externally publishedYes
Event36th European Conference on Information Retrieval, ECIR 2014 - Amsterdam
Duration: 13 Apr 201416 Apr 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8416 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other36th European Conference on Information Retrieval, ECIR 2014
CityAmsterdam
Period13/4/1416/4/14

Fingerprint

Wikipedia
Query languages
N-gram
Resources
Count
Cross-language Information Retrieval
Statistical Machine Translation
Language
Similarity
Baseline
Dependent
Model
Range of data

Keywords

  • Cross-Lingual Similarity
  • Wikipedia

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Barron, A., Paramita, M. L., Clough, P., & Rosso, P. (2014). A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8416 LNCS, pp. 424-429). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8416 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-06028-6_36

A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. / Barron, Alberto; Paramita, Monica Lestari; Clough, Paul; Rosso, Paolo.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8416 LNCS Springer Verlag, 2014. p. 424-429 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8416 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barron, A, Paramita, ML, Clough, P & Rosso, P 2014, A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 8416 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8416 LNCS, Springer Verlag, pp. 424-429, 36th European Conference on Information Retrieval, ECIR 2014, Amsterdam, 13/4/14. https://doi.org/10.1007/978-3-319-06028-6_36
Barron A, Paramita ML, Clough P, Rosso P. A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8416 LNCS. Springer Verlag. 2014. p. 424-429. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-06028-6_36
Barron, Alberto ; Paramita, Monica Lestari ; Clough, Paul ; Rosso, Paolo. / A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8416 LNCS Springer Verlag, 2014. pp. 424-429 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{2170913abc3545d599eaa9f879c7b4dc,
title = "A comparison of approaches for measuring cross-lingual similarity of wikipedia articles",
abstract = "Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).",
keywords = "Cross-Lingual Similarity, Wikipedia",
author = "Alberto Barron and Paramita, {Monica Lestari} and Paul Clough and Paolo Rosso",
year = "2014",
doi = "10.1007/978-3-319-06028-6_36",
language = "English",
isbn = "9783319060279",
volume = "8416 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "424--429",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - A comparison of approaches for measuring cross-lingual similarity of wikipedia articles

AU - Barron, Alberto

AU - Paramita, Monica Lestari

AU - Clough, Paul

AU - Rosso, Paolo

PY - 2014

Y1 - 2014

N2 - Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).

AB - Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).

KW - Cross-Lingual Similarity

KW - Wikipedia

UR - http://www.scopus.com/inward/record.url?scp=84899978624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899978624&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-06028-6_36

DO - 10.1007/978-3-319-06028-6_36

M3 - Conference contribution

SN - 9783319060279

VL - 8416 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 424

EP - 429

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

PB - Springer Verlag

ER -