Altruistic Crowdsourcing for Arabic Speech Corpus Annotation

Soumia Bougrine, Hadda Cherroun, Ahmed Abdelali

Research output: Contribution to journalConference article

1 Citation (Scopus)

Abstract

Crowdsourcing is an emerging collaborative approach that can be used for effective annotations of linguistic resources. There are many crowdsourcing genres: paid-for, games with a purpose, or altruistic (volunteer-based) approaches. In this paper, we investigate the use of altruistic crowdsourcing for speech corpora annotation by narrating our experience of validating a semi-automatic task for dialect annotation of Kalam'DZ, a corpus dedicated to Arabic Algerian dialectal varieties. We start by describing the whole process of designing altruistic crowdsourcing project. Using the unpaid crowdcrafting platform, we have performed experiments on a sample of 10% of Kalam'DZ corpus, totaling more than 10 h with 1012 speakers. The evaluation of this crowdsourcing job is ensured through a comparison with a gold standard annotation done by experts which affirms a high level of inter-annotation agreements of 81%. Our results confirm that altruistic crowdsourcing is an effective approach for speech dialect annotation. In addition, we present a set of best practices for altruistic crowdsourcing for corpus annotations.

Original languageEnglish
Pages (from-to)137-144
Number of pages8
JournalProcedia Computer Science
Volume117
DOIs
Publication statusPublished - 1 Jan 2017

Fingerprint

Linguistics
Experiments

Keywords

  • Altruistic Crowdsourcing
  • Arabic Algerian Dialects
  • Best Practices
  • Crowdcrafting
  • Dialect Annotation

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Altruistic Crowdsourcing for Arabic Speech Corpus Annotation. / Bougrine, Soumia; Cherroun, Hadda; Abdelali, Ahmed.

In: Procedia Computer Science, Vol. 117, 01.01.2017, p. 137-144.

Research output: Contribution to journalConference article

Bougrine, Soumia ; Cherroun, Hadda ; Abdelali, Ahmed. / Altruistic Crowdsourcing for Arabic Speech Corpus Annotation. In: Procedia Computer Science. 2017 ; Vol. 117. pp. 137-144.
@article{0de663ad1b7748db9e42d9f5e0935217,
title = "Altruistic Crowdsourcing for Arabic Speech Corpus Annotation",
abstract = "Crowdsourcing is an emerging collaborative approach that can be used for effective annotations of linguistic resources. There are many crowdsourcing genres: paid-for, games with a purpose, or altruistic (volunteer-based) approaches. In this paper, we investigate the use of altruistic crowdsourcing for speech corpora annotation by narrating our experience of validating a semi-automatic task for dialect annotation of Kalam'DZ, a corpus dedicated to Arabic Algerian dialectal varieties. We start by describing the whole process of designing altruistic crowdsourcing project. Using the unpaid crowdcrafting platform, we have performed experiments on a sample of 10{\%} of Kalam'DZ corpus, totaling more than 10 h with 1012 speakers. The evaluation of this crowdsourcing job is ensured through a comparison with a gold standard annotation done by experts which affirms a high level of inter-annotation agreements of 81{\%}. Our results confirm that altruistic crowdsourcing is an effective approach for speech dialect annotation. In addition, we present a set of best practices for altruistic crowdsourcing for corpus annotations.",
keywords = "Altruistic Crowdsourcing, Arabic Algerian Dialects, Best Practices, Crowdcrafting, Dialect Annotation",
author = "Soumia Bougrine and Hadda Cherroun and Ahmed Abdelali",
year = "2017",
month = "1",
day = "1",
doi = "10.1016/j.procs.2017.10.102",
language = "English",
volume = "117",
pages = "137--144",
journal = "Procedia Computer Science",
issn = "1877-0509",
publisher = "Elsevier BV",

}

TY - JOUR

T1 - Altruistic Crowdsourcing for Arabic Speech Corpus Annotation

AU - Bougrine, Soumia

AU - Cherroun, Hadda

AU - Abdelali, Ahmed

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Crowdsourcing is an emerging collaborative approach that can be used for effective annotations of linguistic resources. There are many crowdsourcing genres: paid-for, games with a purpose, or altruistic (volunteer-based) approaches. In this paper, we investigate the use of altruistic crowdsourcing for speech corpora annotation by narrating our experience of validating a semi-automatic task for dialect annotation of Kalam'DZ, a corpus dedicated to Arabic Algerian dialectal varieties. We start by describing the whole process of designing altruistic crowdsourcing project. Using the unpaid crowdcrafting platform, we have performed experiments on a sample of 10% of Kalam'DZ corpus, totaling more than 10 h with 1012 speakers. The evaluation of this crowdsourcing job is ensured through a comparison with a gold standard annotation done by experts which affirms a high level of inter-annotation agreements of 81%. Our results confirm that altruistic crowdsourcing is an effective approach for speech dialect annotation. In addition, we present a set of best practices for altruistic crowdsourcing for corpus annotations.

AB - Crowdsourcing is an emerging collaborative approach that can be used for effective annotations of linguistic resources. There are many crowdsourcing genres: paid-for, games with a purpose, or altruistic (volunteer-based) approaches. In this paper, we investigate the use of altruistic crowdsourcing for speech corpora annotation by narrating our experience of validating a semi-automatic task for dialect annotation of Kalam'DZ, a corpus dedicated to Arabic Algerian dialectal varieties. We start by describing the whole process of designing altruistic crowdsourcing project. Using the unpaid crowdcrafting platform, we have performed experiments on a sample of 10% of Kalam'DZ corpus, totaling more than 10 h with 1012 speakers. The evaluation of this crowdsourcing job is ensured through a comparison with a gold standard annotation done by experts which affirms a high level of inter-annotation agreements of 81%. Our results confirm that altruistic crowdsourcing is an effective approach for speech dialect annotation. In addition, we present a set of best practices for altruistic crowdsourcing for corpus annotations.

KW - Altruistic Crowdsourcing

KW - Arabic Algerian Dialects

KW - Best Practices

KW - Crowdcrafting

KW - Dialect Annotation

UR - http://www.scopus.com/inward/record.url?scp=85037740770&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037740770&partnerID=8YFLogxK

U2 - 10.1016/j.procs.2017.10.102

DO - 10.1016/j.procs.2017.10.102

M3 - Conference article

AN - SCOPUS:85037740770

VL - 117

SP - 137

EP - 144

JO - Procedia Computer Science

JF - Procedia Computer Science

SN - 1877-0509

ER -