Abstract
End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.
Original language | English |
---|---|
Pages (from-to) | 81-85 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2019-September |
DOIs | |
Publication status | Published - 1 Jan 2019 |
Event | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria Duration: 15 Sep 2019 → 19 Sep 2019 |
Fingerprint
Keywords
- Analysis
- End-to-end
- Graphemes
- Interpretability
- Phonemes
- Speech recognition
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation
Cite this
Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition. / Belinkov, Yonatan; Ali, Ahmed; Glass, James.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2019-September, 01.01.2019, p. 81-85.Research output: Contribution to journal › Conference article
}
TY - JOUR
T1 - Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition
AU - Belinkov, Yonatan
AU - Ali, Ahmed
AU - Glass, James
PY - 2019/1/1
Y1 - 2019/1/1
N2 - End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.
AB - End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.
KW - Analysis
KW - End-to-end
KW - Graphemes
KW - Interpretability
KW - Phonemes
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85074733942&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074733942&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-2599
DO - 10.21437/Interspeech.2019-2599
M3 - Conference article
AN - SCOPUS:85074733942
VL - 2019-September
SP - 81
EP - 85
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
ER -