Crowdsource a little to label a lot

Labeling a speech corpus of dialectal Arabic

Samantha Wray, Ahmed Ali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognition and natural language processing. Existing DA speech corpora are limited in scope, consisting of mainly telephone conversations and scripted speech. In this paper we describe our efforts for using crowdsourcing to create a labeled multi-dialectal speech corpus. We obtained utterance-level dialect labels for 57 hours of high-quality audio from Al Jazeera consisting of four major varieties of DA: Egyptian, Levantine, Gulf, and North African. Using speaker linking to identify utterances spoken by the same speaker, and measures of label accuracy likelihood based on annotator behavior, we automatically labeled an additional 94 hours. The complete corpus contains 850 hours with approximately 18% DA speech.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech and Communication Association
Pages2824-2828
Number of pages5
Volume2015-January
Publication statusPublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: 6 Sep 201510 Sep 2015

Other

Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
CountryGermany
CityDresden
Period6/9/1510/9/15

Fingerprint

Labeling
Labels
Speech Recognition
Speech recognition
Telephone
Natural Language
Linking
Likelihood
Necessary
Corpus
Speech
Processing
Standards
Utterance

Keywords

  • Arabic
  • Corpora creation
  • Crowdsourcing
  • Dialect classification
  • Human computation
  • Speech corpora

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Wray, S., & Ali, A. (2015). Crowdsource a little to label a lot: Labeling a speech corpus of dialectal Arabic. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 2824-2828). International Speech and Communication Association.

Crowdsource a little to label a lot : Labeling a speech corpus of dialectal Arabic. / Wray, Samantha; Ali, Ahmed.

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. p. 2824-2828.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wray, S & Ali, A 2015, Crowdsource a little to label a lot: Labeling a speech corpus of dialectal Arabic. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2015-January, International Speech and Communication Association, pp. 2824-2828, 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 6/9/15.
Wray S, Ali A. Crowdsource a little to label a lot: Labeling a speech corpus of dialectal Arabic. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January. International Speech and Communication Association. 2015. p. 2824-2828
Wray, Samantha ; Ali, Ahmed. / Crowdsource a little to label a lot : Labeling a speech corpus of dialectal Arabic. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. pp. 2824-2828
@inproceedings{1f6a11aa4b1c4332824e872a5f284a11,
title = "Crowdsource a little to label a lot: Labeling a speech corpus of dialectal Arabic",
abstract = "Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognition and natural language processing. Existing DA speech corpora are limited in scope, consisting of mainly telephone conversations and scripted speech. In this paper we describe our efforts for using crowdsourcing to create a labeled multi-dialectal speech corpus. We obtained utterance-level dialect labels for 57 hours of high-quality audio from Al Jazeera consisting of four major varieties of DA: Egyptian, Levantine, Gulf, and North African. Using speaker linking to identify utterances spoken by the same speaker, and measures of label accuracy likelihood based on annotator behavior, we automatically labeled an additional 94 hours. The complete corpus contains 850 hours with approximately 18{\%} DA speech.",
keywords = "Arabic, Corpora creation, Crowdsourcing, Dialect classification, Human computation, Speech corpora",
author = "Samantha Wray and Ahmed Ali",
year = "2015",
language = "English",
volume = "2015-January",
pages = "2824--2828",
booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
publisher = "International Speech and Communication Association",

}

TY - GEN

T1 - Crowdsource a little to label a lot

T2 - Labeling a speech corpus of dialectal Arabic

AU - Wray, Samantha

AU - Ali, Ahmed

PY - 2015

Y1 - 2015

N2 - Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognition and natural language processing. Existing DA speech corpora are limited in scope, consisting of mainly telephone conversations and scripted speech. In this paper we describe our efforts for using crowdsourcing to create a labeled multi-dialectal speech corpus. We obtained utterance-level dialect labels for 57 hours of high-quality audio from Al Jazeera consisting of four major varieties of DA: Egyptian, Levantine, Gulf, and North African. Using speaker linking to identify utterances spoken by the same speaker, and measures of label accuracy likelihood based on annotator behavior, we automatically labeled an additional 94 hours. The complete corpus contains 850 hours with approximately 18% DA speech.

AB - Arabic is a language with great dialectal variety, with Modern Standard Arabic (MSA) being the only standardized dialect. Spoken Arabic is characterized by frequent code-switching between MSA and Dialectal Arabic (DA). DA varieties are typically differentiated by region, but despite their wide-spread usage, they are under-resourced and lack viable corpora and tools necessary for speech recognition and natural language processing. Existing DA speech corpora are limited in scope, consisting of mainly telephone conversations and scripted speech. In this paper we describe our efforts for using crowdsourcing to create a labeled multi-dialectal speech corpus. We obtained utterance-level dialect labels for 57 hours of high-quality audio from Al Jazeera consisting of four major varieties of DA: Egyptian, Levantine, Gulf, and North African. Using speaker linking to identify utterances spoken by the same speaker, and measures of label accuracy likelihood based on annotator behavior, we automatically labeled an additional 94 hours. The complete corpus contains 850 hours with approximately 18% DA speech.

KW - Arabic

KW - Corpora creation

KW - Crowdsourcing

KW - Dialect classification

KW - Human computation

KW - Speech corpora

UR - http://www.scopus.com/inward/record.url?scp=84959102833&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959102833&partnerID=8YFLogxK

M3 - Conference contribution

VL - 2015-January

SP - 2824

EP - 2828

BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

PB - International Speech and Communication Association

ER -