An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

Cristina Espana-Bonet, Adam Csaba Varga, Alberto Barron, Josef Van Genabith

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.

Original languageEnglish
Article number8070942
Pages (from-to)1340-1350
Number of pages11
JournalIEEE Journal on Selected Topics in Signal Processing
Volume11
Issue number8
DOIs
Publication statusPublished - 1 Dec 2017

Fingerprint

Semantics
Neural networks
Costs

Keywords

  • Learning
  • natural language processing
  • neural networks

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification. / Espana-Bonet, Cristina; Varga, Adam Csaba; Barron, Alberto; Van Genabith, Josef.

In: IEEE Journal on Selected Topics in Signal Processing, Vol. 11, No. 8, 8070942, 01.12.2017, p. 1340-1350.

Research output: Contribution to journalArticle

@article{a97b10c074ae440cbf51a2171a06e5df,
title = "An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification",
abstract = "End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2{\%} on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9{\%}.",
keywords = "Learning, natural language processing, neural networks",
author = "Cristina Espana-Bonet and Varga, {Adam Csaba} and Alberto Barron and {Van Genabith}, Josef",
year = "2017",
month = "12",
day = "1",
doi = "10.1109/JSTSP.2017.2764273",
language = "English",
volume = "11",
pages = "1340--1350",
journal = "IEEE Journal on Selected Topics in Signal Processing",
issn = "1932-4553",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "8",

}

TY - JOUR

T1 - An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

AU - Espana-Bonet, Cristina

AU - Varga, Adam Csaba

AU - Barron, Alberto

AU - Van Genabith, Josef

PY - 2017/12/1

Y1 - 2017/12/1

N2 - End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.

AB - End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.

KW - Learning

KW - natural language processing

KW - neural networks

UR - http://www.scopus.com/inward/record.url?scp=85032280575&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032280575&partnerID=8YFLogxK

U2 - 10.1109/JSTSP.2017.2764273

DO - 10.1109/JSTSP.2017.2764273

M3 - Article

AN - SCOPUS:85032280575

VL - 11

SP - 1340

EP - 1350

JO - IEEE Journal on Selected Topics in Signal Processing

JF - IEEE Journal on Selected Topics in Signal Processing

SN - 1932-4553

IS - 8

M1 - 8070942

ER -