Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition

Yonatan Belinkov, Ahmed Ali, James Glass

Research output: Contribution to journalConference article

Abstract

End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.

Original languageEnglish
Pages (from-to)81-85
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2019-September
DOIs
Publication statusPublished - 1 Jan 2019
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: 15 Sep 201919 Sep 2019

Fingerprint

Automatic Speech Recognition
Speech analysis
Speech recognition
Acoustics
Neural Networks
Neural networks
Language Modeling
Transcription
Neural Network Model
Paradigm
Entire
Internal
Evaluate
Modeling
Graphemics
Training
Model

Keywords

  • Analysis
  • End-to-end
  • Graphemes
  • Interpretability
  • Phonemes
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition. / Belinkov, Yonatan; Ali, Ahmed; Glass, James.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2019-September, 01.01.2019, p. 81-85.

Research output: Contribution to journalConference article

@article{10c7fac1978e478bab54c4df67b70f83,
title = "Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition",
abstract = "End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.",
keywords = "Analysis, End-to-end, Graphemes, Interpretability, Phonemes, Speech recognition",
author = "Yonatan Belinkov and Ahmed Ali and James Glass",
year = "2019",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2019-2599",
language = "English",
volume = "2019-September",
pages = "81--85",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition

AU - Belinkov, Yonatan

AU - Ali, Ahmed

AU - Glass, James

PY - 2019/1/1

Y1 - 2019/1/1

N2 - End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.

AB - End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.

KW - Analysis

KW - End-to-end

KW - Graphemes

KW - Interpretability

KW - Phonemes

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85074733942&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074733942&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-2599

DO - 10.21437/Interspeech.2019-2599

M3 - Conference article

AN - SCOPUS:85074733942

VL - 2019-September

SP - 81

EP - 85

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -