Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment

Felix Stahlberg, Tim Schlippe, Stephan Vogel, Tanja Schultz

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5% OOV rate, where 64% of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates - given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach.

Original languageEnglish
Pages (from-to)234-261
Number of pages28
JournalComputer Speech and Language
Volume35
DOIs
Publication statusPublished - 1 Jan 2016

Fingerprint

Alignment
Segmentation
Target
Glossaries
Language
Speech recognition
Error Recovery
Resources
Automatic Speech Recognition
Language Model
Error Rate
Clustering
Model
Scenarios
Dependent
Evaluate

Keywords

  • Lexical language discovery
  • Non-written languages
  • Pronunciation dictionary
  • Speech-to-speech translation
  • Under-resourced languages
  • Word segmentation

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Theoretical Computer Science

Cite this

Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. / Stahlberg, Felix; Schlippe, Tim; Vogel, Stephan; Schultz, Tanja.

In: Computer Speech and Language, Vol. 35, 01.01.2016, p. 234-261.

Research output: Contribution to journalArticle

@article{2f981e2579f94110b1c51247be3cbc84,
title = "Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment",
abstract = "In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1{\%} errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5{\%} OOV rate, where 64{\%} of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates - given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach.",
keywords = "Lexical language discovery, Non-written languages, Pronunciation dictionary, Speech-to-speech translation, Under-resourced languages, Word segmentation",
author = "Felix Stahlberg and Tim Schlippe and Stephan Vogel and Tanja Schultz",
year = "2016",
month = "1",
day = "1",
doi = "10.1016/j.csl.2014.10.001",
language = "English",
volume = "35",
pages = "234--261",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment

AU - Stahlberg, Felix

AU - Schlippe, Tim

AU - Vogel, Stephan

AU - Schultz, Tanja

PY - 2016/1/1

Y1 - 2016/1/1

N2 - In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5% OOV rate, where 64% of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates - given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach.

AB - In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5% OOV rate, where 64% of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates - given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach.

KW - Lexical language discovery

KW - Non-written languages

KW - Pronunciation dictionary

KW - Speech-to-speech translation

KW - Under-resourced languages

KW - Word segmentation

UR - http://www.scopus.com/inward/record.url?scp=84942532197&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84942532197&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2014.10.001

DO - 10.1016/j.csl.2014.10.001

M3 - Article

AN - SCOPUS:84942532197

VL - 35

SP - 234

EP - 261

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

ER -