Crowdsourcing speech and language data for resource-poor languages

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we present benefits of using crowdsourcing to build speech and language resources for different annotation tasks for dialectal Arabic as an example of resource-poor languages. We show recommendations for job design and quality control that allow us to build high quality data for variety of tasks. Most of these recommendations are language-independent and can be applied to other languages as well. We summarize lessons learned from experiments in data acquisition tasks, such as image annotation (transcription of Arabic historical documents), machine translation (translation from English to Hindi), speech annotation (transcription of dialectal Arabic audio files), text annotation (conversion from dialectal Arabic to Modern Standard Arabic (MSA)), and text classification (annotation of offensive language on Arabic social media, and classification of questions on Arabic medical web forums).

Original languageEnglish
Title of host publicationProceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017
PublisherSpringer Verlag
Pages440-447
Number of pages8
ISBN (Print)9783319648606
DOIs
Publication statusPublished - 1 Jan 2018
Event3rd International Conference on Advanced Intelligent Systems and Informatics, AISI 2017 - Cairo, Egypt
Duration: 9 Sep 201711 Sep 2017

Publication series

NameAdvances in Intelligent Systems and Computing
Volume639
ISSN (Print)2194-5357

Other

Other3rd International Conference on Advanced Intelligent Systems and Informatics, AISI 2017
CountryEgypt
CityCairo
Period9/9/1711/9/17

Fingerprint

Transcription
Quality control
Data acquisition
Experiments

Keywords

  • Crowdsourcing
  • Dialectal arabic
  • Low-resource languages

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Computer Science(all)

Cite this

Mubarak, H. (2018). Crowdsourcing speech and language data for resource-poor languages. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017 (pp. 440-447). (Advances in Intelligent Systems and Computing; Vol. 639). Springer Verlag. https://doi.org/10.1007/978-3-319-64861-3_41

Crowdsourcing speech and language data for resource-poor languages. / Mubarak, Hamdy.

Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017. Springer Verlag, 2018. p. 440-447 (Advances in Intelligent Systems and Computing; Vol. 639).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mubarak, H 2018, Crowdsourcing speech and language data for resource-poor languages. in Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017. Advances in Intelligent Systems and Computing, vol. 639, Springer Verlag, pp. 440-447, 3rd International Conference on Advanced Intelligent Systems and Informatics, AISI 2017, Cairo, Egypt, 9/9/17. https://doi.org/10.1007/978-3-319-64861-3_41
Mubarak H. Crowdsourcing speech and language data for resource-poor languages. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017. Springer Verlag. 2018. p. 440-447. (Advances in Intelligent Systems and Computing). https://doi.org/10.1007/978-3-319-64861-3_41
Mubarak, Hamdy. / Crowdsourcing speech and language data for resource-poor languages. Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017. Springer Verlag, 2018. pp. 440-447 (Advances in Intelligent Systems and Computing).
@inproceedings{5d6a5aadb16746199fd502820aa2e51c,
title = "Crowdsourcing speech and language data for resource-poor languages",
abstract = "In this paper, we present benefits of using crowdsourcing to build speech and language resources for different annotation tasks for dialectal Arabic as an example of resource-poor languages. We show recommendations for job design and quality control that allow us to build high quality data for variety of tasks. Most of these recommendations are language-independent and can be applied to other languages as well. We summarize lessons learned from experiments in data acquisition tasks, such as image annotation (transcription of Arabic historical documents), machine translation (translation from English to Hindi), speech annotation (transcription of dialectal Arabic audio files), text annotation (conversion from dialectal Arabic to Modern Standard Arabic (MSA)), and text classification (annotation of offensive language on Arabic social media, and classification of questions on Arabic medical web forums).",
keywords = "Crowdsourcing, Dialectal arabic, Low-resource languages",
author = "Hamdy Mubarak",
year = "2018",
month = "1",
day = "1",
doi = "10.1007/978-3-319-64861-3_41",
language = "English",
isbn = "9783319648606",
series = "Advances in Intelligent Systems and Computing",
publisher = "Springer Verlag",
pages = "440--447",
booktitle = "Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017",

}

TY - GEN

T1 - Crowdsourcing speech and language data for resource-poor languages

AU - Mubarak, Hamdy

PY - 2018/1/1

Y1 - 2018/1/1

N2 - In this paper, we present benefits of using crowdsourcing to build speech and language resources for different annotation tasks for dialectal Arabic as an example of resource-poor languages. We show recommendations for job design and quality control that allow us to build high quality data for variety of tasks. Most of these recommendations are language-independent and can be applied to other languages as well. We summarize lessons learned from experiments in data acquisition tasks, such as image annotation (transcription of Arabic historical documents), machine translation (translation from English to Hindi), speech annotation (transcription of dialectal Arabic audio files), text annotation (conversion from dialectal Arabic to Modern Standard Arabic (MSA)), and text classification (annotation of offensive language on Arabic social media, and classification of questions on Arabic medical web forums).

AB - In this paper, we present benefits of using crowdsourcing to build speech and language resources for different annotation tasks for dialectal Arabic as an example of resource-poor languages. We show recommendations for job design and quality control that allow us to build high quality data for variety of tasks. Most of these recommendations are language-independent and can be applied to other languages as well. We summarize lessons learned from experiments in data acquisition tasks, such as image annotation (transcription of Arabic historical documents), machine translation (translation from English to Hindi), speech annotation (transcription of dialectal Arabic audio files), text annotation (conversion from dialectal Arabic to Modern Standard Arabic (MSA)), and text classification (annotation of offensive language on Arabic social media, and classification of questions on Arabic medical web forums).

KW - Crowdsourcing

KW - Dialectal arabic

KW - Low-resource languages

UR - http://www.scopus.com/inward/record.url?scp=85029469138&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029469138&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-64861-3_41

DO - 10.1007/978-3-319-64861-3_41

M3 - Conference contribution

SN - 9783319648606

T3 - Advances in Intelligent Systems and Computing

SP - 440

EP - 447

BT - Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017

PB - Springer Verlag

ER -