Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus

Svetlin Nakov, Preslav Nakov, Elena Paskaleva

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

False friends are pairs of words in two languages that are perceived as similar, but have different meanings, e.g., Gift in German means poison in English. In this paper, we present several unsupervised algorithms for acquiring such pairs from a sentence-aligned bi-text. First, we try different ways of exploiting simple statistics about monolingual word occurrences and cross-lingual word co-occurrences in the bi-text. Second, using methods from statistical machine translation, we induce word alignments in an unsupervised way, from which we estimate lexical translation probabilities, which we use to measure cross-lingual semantic similarity. Third, we experiment with a semantic similarity measure that uses the Web as a corpus to extract local contexts from text snippets returned by a search engine, and a bilingual glossary of known word translation pairs, used as "bridges". Finally, all measures are combined and applied to the task of identifying likely false friends. The evaluation for Russian and Bulgarian shows a significant improvement over previously-proposed algorithms.

Original languageEnglish
Title of host publicationInternational Conference Recent Advances in Natural Language Processing, RANLP
Pages292-298
Number of pages7
Publication statusPublished - 2009
Externally publishedYes
EventInternational Conference on Recent Advances in Natural Language Processing, RANLP-2009 - Borovets, Bulgaria
Duration: 14 Sep 200916 Sep 2009

Other

OtherInternational Conference on Recent Advances in Natural Language Processing, RANLP-2009
CountryBulgaria
CityBorovets
Period14/9/0916/9/09

Fingerprint

Semantics
Glossaries
Search engines
Statistics
Experiments

Keywords

  • Cognates
  • Cross-lingual semantic similarity
  • False friends
  • Statistical machine translation
  • Web as a corpus

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Software
  • Electrical and Electronic Engineering

Cite this

Nakov, S., Nakov, P., & Paskaleva, E. (2009). Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 292-298)

Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus. / Nakov, Svetlin; Nakov, Preslav; Paskaleva, Elena.

International Conference Recent Advances in Natural Language Processing, RANLP. 2009. p. 292-298.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nakov, S, Nakov, P & Paskaleva, E 2009, Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus. in International Conference Recent Advances in Natural Language Processing, RANLP. pp. 292-298, International Conference on Recent Advances in Natural Language Processing, RANLP-2009, Borovets, Bulgaria, 14/9/09.
Nakov S, Nakov P, Paskaleva E. Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus. In International Conference Recent Advances in Natural Language Processing, RANLP. 2009. p. 292-298
Nakov, Svetlin ; Nakov, Preslav ; Paskaleva, Elena. / Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus. International Conference Recent Advances in Natural Language Processing, RANLP. 2009. pp. 292-298
@inproceedings{d6bf01f4759b42a1b4dc99affdaea03e,
title = "Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus",
abstract = "False friends are pairs of words in two languages that are perceived as similar, but have different meanings, e.g., Gift in German means poison in English. In this paper, we present several unsupervised algorithms for acquiring such pairs from a sentence-aligned bi-text. First, we try different ways of exploiting simple statistics about monolingual word occurrences and cross-lingual word co-occurrences in the bi-text. Second, using methods from statistical machine translation, we induce word alignments in an unsupervised way, from which we estimate lexical translation probabilities, which we use to measure cross-lingual semantic similarity. Third, we experiment with a semantic similarity measure that uses the Web as a corpus to extract local contexts from text snippets returned by a search engine, and a bilingual glossary of known word translation pairs, used as {"}bridges{"}. Finally, all measures are combined and applied to the task of identifying likely false friends. The evaluation for Russian and Bulgarian shows a significant improvement over previously-proposed algorithms.",
keywords = "Cognates, Cross-lingual semantic similarity, False friends, Statistical machine translation, Web as a corpus",
author = "Svetlin Nakov and Preslav Nakov and Elena Paskaleva",
year = "2009",
language = "English",
pages = "292--298",
booktitle = "International Conference Recent Advances in Natural Language Processing, RANLP",

}

TY - GEN

T1 - Unsupervised extraction of false friends from parallel bi-texts using the Web as a corpus

AU - Nakov, Svetlin

AU - Nakov, Preslav

AU - Paskaleva, Elena

PY - 2009

Y1 - 2009

N2 - False friends are pairs of words in two languages that are perceived as similar, but have different meanings, e.g., Gift in German means poison in English. In this paper, we present several unsupervised algorithms for acquiring such pairs from a sentence-aligned bi-text. First, we try different ways of exploiting simple statistics about monolingual word occurrences and cross-lingual word co-occurrences in the bi-text. Second, using methods from statistical machine translation, we induce word alignments in an unsupervised way, from which we estimate lexical translation probabilities, which we use to measure cross-lingual semantic similarity. Third, we experiment with a semantic similarity measure that uses the Web as a corpus to extract local contexts from text snippets returned by a search engine, and a bilingual glossary of known word translation pairs, used as "bridges". Finally, all measures are combined and applied to the task of identifying likely false friends. The evaluation for Russian and Bulgarian shows a significant improvement over previously-proposed algorithms.

AB - False friends are pairs of words in two languages that are perceived as similar, but have different meanings, e.g., Gift in German means poison in English. In this paper, we present several unsupervised algorithms for acquiring such pairs from a sentence-aligned bi-text. First, we try different ways of exploiting simple statistics about monolingual word occurrences and cross-lingual word co-occurrences in the bi-text. Second, using methods from statistical machine translation, we induce word alignments in an unsupervised way, from which we estimate lexical translation probabilities, which we use to measure cross-lingual semantic similarity. Third, we experiment with a semantic similarity measure that uses the Web as a corpus to extract local contexts from text snippets returned by a search engine, and a bilingual glossary of known word translation pairs, used as "bridges". Finally, all measures are combined and applied to the task of identifying likely false friends. The evaluation for Russian and Bulgarian shows a significant improvement over previously-proposed algorithms.

KW - Cognates

KW - Cross-lingual semantic similarity

KW - False friends

KW - Statistical machine translation

KW - Web as a corpus

UR - http://www.scopus.com/inward/record.url?scp=84866883076&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866883076&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84866883076

SP - 292

EP - 298

BT - International Conference Recent Advances in Natural Language Processing, RANLP

ER -